Back to Skills
knowd
by Community
Personal knowledge base โ save web pages and search them semantically. Use when user shares a URL to save/bookmark/remember, asks "what did I save about...", wants to search saved articles, or asks about their knowledge base. Supports multiple embedding providers (OpenAI, Voyage, Cohere, Jina, Ollama).
1.0.0
$ npx skills add https://github.com/ianpcook/knowd-skillFiles
README.md
2.3 KB
# knowd ๐ง Personal knowledge base for AI agents. Save web pages, search them semantically. No UI โ just conversation. Tell your agent "save this URL" and it extracts, chunks, embeds, and stores it. Ask "what did I save about X?" and it finds the most relevant passages. ## Features - **Multi-provider embeddings** โ OpenAI, Voyage AI, Cohere, Jina, or Ollama (local/free) - **Semantic search** โ cosine similarity over embedded chunks - **Content extraction** โ trafilatura handles the messy web - **SQLite storage** โ single file, no external services - **Provider lock-in protection** โ DB tracks which provider was used; can't mix incompatible embeddings ## Install ### As an OpenClaw / Clawdbot skill ```bash # From Skills N'at # Visit https://skills-nat.vercel.app and install via repo URL # Or manually git clone https://github.com/ianpcook/knowd-skill.git skills/knowd pip3 install -r skills/knowd/requirements.txt ``` ### Standalone ```bash git clone https://github.com/ianpcook/knowd-skill.git cd knowd-skill pip3 install -r requirements.txt python3 scripts/knowd.py --help ``` ## Setup Set at least one embedding provider API key: | Provider | Env Variable | Model | |----------|-------------|-------| | OpenAI (default) | `OPENAI_API_KEY` | text-embedding-3-small | | Voyage AI | `VOYAGE_API_KEY` | voyage-3-lite | | Cohere | `COHERE_API_KEY` | embed-v4 | | Jina | `JINA_API_KEY` | jina-embeddings-v3 | | Ollama (local) | โ | nomic-embed-text | ## Usage ```bash # Save a URL python3 scripts/knowd.py save "https://example.com/article" # Search your knowledge python3 scripts/knowd.py search "machine learning best practices" -k 5 # List saved sources python3 scripts/knowd.py list # Stats python3 scripts/knowd.py stats # Delete python3 scripts/knowd.py delete "https://example.com/article" # List available providers python3 scripts/knowd.py providers ``` ## How it works 1. **Fetch** โ downloads the page, extracts readable text via trafilatura 2. **Chunk** โ splits into ~500-token overlapping chunks at sentence boundaries 3. **Embed** โ sends chunks to your chosen embedding provider 4. **Store** โ SQLite with binary embeddings (no vector DB dependency) 5. **Search** โ embeds your query, computes cosine similarity against all chunks ## License MIT
SKILL.mdMain
3.1 KB
---
name: knowd
description: Personal knowledge base โ save web pages and search them semantically. Use when user shares a URL to save/bookmark/remember, asks "what did I save about...", wants to search saved articles, or asks about their knowledge base. Supports multiple embedding providers (OpenAI, Voyage, Cohere, Jina, Ollama).
metadata: {"clawdbot": {"emoji": "๐ง ", "requires": {"bins": ["python3"]}, "primaryEnv": "OPENAI_API_KEY"}}
---
# knowd โ Personal Knowledge Base
Save web pages, search them semantically. No UI โ just conversation.
## Setup
Install dependencies:
```bash
pip3 install -r <skill>/requirements.txt
```
Set ONE of these API keys (or use Ollama for local/free):
- `OPENAI_API_KEY` โ OpenAI text-embedding-3-small (default)
- `VOYAGE_API_KEY` โ Voyage AI voyage-3-lite
- `COHERE_API_KEY` โ Cohere embed-v4
- `JINA_API_KEY` โ Jina jina-embeddings-v3
- Ollama: no key needed, just have Ollama running locally
The provider is locked in on first save โ can't mix embedding spaces in the same DB.
## Commands
```bash
# Save a URL
python3 <skill>/scripts/knowd.py save "<url>"
# Save with a specific provider (first use only sets the provider)
python3 <skill>/scripts/knowd.py --provider voyage save "<url>"
# Semantic search
python3 <skill>/scripts/knowd.py search "<query>" -k 5
# List saved sources
python3 <skill>/scripts/knowd.py list
# Stats (includes provider info)
python3 <skill>/scripts/knowd.py stats
# Delete a source
python3 <skill>/scripts/knowd.py delete "<url-or-id>"
# List available providers and which keys are set
python3 <skill>/scripts/knowd.py providers
```
## When to Use
### Saving
When user shares a URL with intent to save ("save this", "remember this", "bookmark this", "add to my knowledge base"):
1. Run `knowd save "<url>"`
2. Report: title, chunk count, provider used
3. Be conversational: "Saved! Got 8 chunks from 'Article Title' via openai."
### Searching
When user asks about saved knowledge ("what did I save about...", "find that article about...", "search my knowledge base for..."):
1. Run `knowd search "<query>"`
2. Summarize results naturally โ titles, relevant snippets, scores only if helpful
3. Don't dump raw output; synthesize
### Listing
When user asks what they've saved:
1. Run `knowd list`
2. Present as a clean list with titles and dates
### Provider Selection
- On first use, if user hasn't specified, auto-detect: use whichever API key is available in the environment
- If multiple keys exist, prefer OpenAI (most common)
- If user explicitly requests a provider: `--provider cohere`
- After first save, the provider is locked to that DB
## State
Database: `<workspace>/state/knowd.db` (SQLite)
The DB stores the embedding provider/model in metadata. Attempting to use a different provider on an existing DB will error with a clear explanation.
## Auto-Detection
If the user doesn't specify a provider, check environment variables in this order:
1. OPENAI_API_KEY โ openai
2. VOYAGE_API_KEY โ voyage
3. COHERE_API_KEY โ cohere
4. JINA_API_KEY โ jina
5. Check if Ollama is running โ ollama
6. Error: no provider available
requirements.txt
24 B
click trafilatura numpy
knowd.py
17.9 KB
#!/usr/bin/env python3
"""knowd โ personal knowledge base with semantic search. Multi-provider embeddings."""
import hashlib
import json
import os
import re
import sqlite3
import sys
import urllib.request
import urllib.error
from datetime import datetime, timezone
from pathlib import Path
from urllib.parse import urlparse
import click
import numpy as np
import trafilatura
# Defaults
DEFAULT_DB = Path(__file__).resolve().parent.parent.parent.parent / "state" / "knowd.db"
CHUNK_SIZE = 500 # ~tokens (approx 4 chars/token)
CHUNK_OVERLAP = 50
# --- Embedding Providers ---
PROVIDERS = {
"openai": {
"env": "OPENAI_API_KEY",
"model": "text-embedding-3-small",
"dims": 1536,
"url": "https://api.openai.com/v1/embeddings",
},
"voyage": {
"env": "VOYAGE_API_KEY",
"model": "voyage-3-lite",
"dims": 1024,
"url": "https://api.voyageai.com/v1/embeddings",
},
"cohere": {
"env": "COHERE_API_KEY",
"model": "embed-v4",
"dims": 1024,
"url": "https://api.cohere.com/v2/embed",
},
"jina": {
"env": "JINA_API_KEY",
"model": "jina-embeddings-v3",
"dims": 1024,
"url": "https://api.jina.ai/v1/embeddings",
},
"ollama": {
"env": None,
"model": "nomic-embed-text",
"dims": 768,
"url": "http://localhost:11434/api/embed",
},
}
def _http_post(url, headers, body, timeout=60):
"""Simple HTTP POST returning parsed JSON."""
data = json.dumps(body).encode()
req = urllib.request.Request(url, data=data, headers=headers, method="POST")
with urllib.request.urlopen(req, timeout=timeout) as resp:
return json.loads(resp.read().decode())
def embed_texts(texts, provider, model, api_key=None):
"""Get embeddings from the specified provider. Returns list of np arrays."""
if provider == "openai":
return _embed_openai(texts, model, api_key)
elif provider == "voyage":
return _embed_voyage(texts, model, api_key)
elif provider == "cohere":
return _embed_cohere(texts, model, api_key)
elif provider == "jina":
return _embed_jina(texts, model, api_key)
elif provider == "ollama":
return _embed_ollama(texts, model)
else:
raise click.ClickException(f"Unknown provider: {provider}")
def _embed_openai(texts, model, api_key):
headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
all_embs = []
for i in range(0, len(texts), 2048):
batch = texts[i:i + 2048]
resp = _http_post(PROVIDERS["openai"]["url"], headers, {"model": model, "input": batch})
for item in sorted(resp["data"], key=lambda x: x["index"]):
all_embs.append(np.array(item["embedding"], dtype=np.float32))
return all_embs
def _embed_voyage(texts, model, api_key):
headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
all_embs = []
for i in range(0, len(texts), 128):
batch = texts[i:i + 128]
resp = _http_post(PROVIDERS["voyage"]["url"], headers, {"model": model, "input": batch})
for item in sorted(resp["data"], key=lambda x: x["index"]):
all_embs.append(np.array(item["embedding"], dtype=np.float32))
return all_embs
def _embed_cohere(texts, model, api_key):
headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
all_embs = []
for i in range(0, len(texts), 96):
batch = texts[i:i + 96]
resp = _http_post(
PROVIDERS["cohere"]["url"], headers,
{"model": model, "texts": batch, "input_type": "search_document", "embedding_types": ["float"]},
)
for emb in resp["embeddings"]["float"]:
all_embs.append(np.array(emb, dtype=np.float32))
return all_embs
def _embed_jina(texts, model, api_key):
headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
all_embs = []
for i in range(0, len(texts), 2048):
batch = texts[i:i + 2048]
resp = _http_post(PROVIDERS["jina"]["url"], headers, {"model": model, "input": batch})
for item in sorted(resp["data"], key=lambda x: x["index"]):
all_embs.append(np.array(item["embedding"], dtype=np.float32))
return all_embs
def _embed_ollama(texts, model):
all_embs = []
for text in texts:
resp = _http_post(PROVIDERS["ollama"]["url"], {"Content-Type": "application/json"},
{"model": model, "input": text})
all_embs.append(np.array(resp["embeddings"][0], dtype=np.float32))
return all_embs
# --- DB ---
def get_db(db_path):
db_path = Path(db_path)
db_path.parent.mkdir(parents=True, exist_ok=True)
conn = sqlite3.connect(str(db_path))
conn.row_factory = sqlite3.Row
conn.execute("PRAGMA foreign_keys = ON")
_init_db(conn)
return conn
def _init_db(conn):
conn.executescript("""
CREATE TABLE IF NOT EXISTS meta (key TEXT PRIMARY KEY, value TEXT);
CREATE TABLE IF NOT EXISTS sources (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE NOT NULL,
title TEXT,
domain TEXT,
saved_at TEXT NOT NULL,
content_hash TEXT NOT NULL
);
CREATE TABLE IF NOT EXISTS chunks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
source_id INTEGER NOT NULL REFERENCES sources(id) ON DELETE CASCADE,
content TEXT NOT NULL,
embedding BLOB,
chunk_index INTEGER NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_chunks_source ON chunks(source_id);
""")
conn.execute("INSERT OR IGNORE INTO meta VALUES ('schema_version', '1')")
conn.commit()
def get_meta(conn, key):
row = conn.execute("SELECT value FROM meta WHERE key = ?", (key,)).fetchone()
return row[0] if row else None
def set_meta(conn, key, value):
conn.execute("INSERT OR REPLACE INTO meta VALUES (?, ?)", (key, value))
conn.commit()
def resolve_provider(conn, provider=None, model=None):
"""Resolve provider/model, enforcing DB lock-in after first use."""
db_provider = get_meta(conn, "embedding_provider")
db_model = get_meta(conn, "embedding_model")
if db_provider:
# DB already has a provider locked in
if provider and provider != db_provider:
raise click.ClickException(
f"This database uses '{db_provider}' embeddings. "
f"Cannot switch to '{provider}' โ vectors would be incompatible. "
f"Use --db to create a separate database, or delete and re-save all sources."
)
provider = db_provider
model = model or db_model
else:
# First use โ set provider
provider = provider or "openai"
if provider not in PROVIDERS:
raise click.ClickException(
f"Unknown provider: {provider}. Choose from: {', '.join(PROVIDERS.keys())}"
)
model = model or PROVIDERS[provider]["model"]
# Validate API key
pinfo = PROVIDERS[provider]
api_key = None
if pinfo["env"]:
api_key = os.environ.get(pinfo["env"])
if not api_key:
raise click.ClickException(
f"Provider '{provider}' requires {pinfo['env']} environment variable."
)
# Lock in on first use
if not db_provider:
set_meta(conn, "embedding_provider", provider)
set_meta(conn, "embedding_model", model)
return provider, model, api_key
# --- Content extraction ---
def fetch_content(url):
"""Extract text and title from a URL."""
downloaded = trafilatura.fetch_url(url)
if not downloaded:
# Fallback: raw urllib
try:
req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0 (compatible; knowd/1.0)"})
with urllib.request.urlopen(req, timeout=30) as resp:
downloaded = resp.read().decode("utf-8", errors="replace")
except Exception:
raise click.ClickException(f"Could not fetch: {url}")
if not downloaded:
raise click.ClickException(f"Could not fetch: {url}")
text = trafilatura.extract(downloaded, include_comments=False, include_tables=True)
title = None
try:
meta = trafilatura.metadata.extract_metadata(downloaded)
if meta and meta.title:
title = meta.title
except Exception:
pass
if not text or len(text) < 50:
raise click.ClickException(f"Could not extract meaningful content from: {url}")
return text, title
# --- Chunking ---
def chunk_text(text, chunk_size=CHUNK_SIZE, overlap=CHUNK_OVERLAP):
"""Split text into chunks of ~chunk_size tokens with sentence-boundary overlap."""
chars = chunk_size * 4
olap = overlap * 4
sentences = []
for line in text.split("\n"):
line = line.strip()
if not line:
continue
parts = re.split(r"(?<=[.!?])\s+", line)
sentences.extend(parts)
chunks = []
current = ""
for s in sentences:
if len(current) + len(s) + 1 > chars and current:
chunks.append(current.strip())
current = current[-olap:] + " " + s if olap else s
else:
current = current + " " + s if current else s
if current.strip():
chunks.append(current.strip())
return chunks if chunks else [text[:chars]]
# --- Search ---
def cosine_sim(a, b):
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-10))
# --- CLI ---
@click.group()
@click.option("--db", default=None, help="Override DB path")
@click.option("--json-output", "--json", "json_out", is_flag=True, help="JSON output")
@click.option("--provider", "-p", default=None, help=f"Embedding provider: {', '.join(PROVIDERS.keys())}")
@click.option("--model", "-m", default=None, help="Override embedding model")
@click.pass_context
def cli(ctx, db, json_out, provider, model):
"""knowd โ save and search your knowledge."""
ctx.ensure_object(dict)
ctx.obj["db_path"] = db or os.environ.get("KNOWD_DB", str(DEFAULT_DB))
ctx.obj["json"] = json_out
ctx.obj["provider"] = provider
ctx.obj["model"] = model
@cli.command()
@click.pass_context
def init(ctx):
"""Initialize the database."""
conn = get_db(ctx.obj["db_path"])
conn.close()
click.echo("Database initialized.")
@cli.command()
@click.argument("url")
@click.pass_context
def save(ctx, url):
"""Save a URL to the knowledge base."""
conn = get_db(ctx.obj["db_path"])
provider, model, api_key = resolve_provider(conn, ctx.obj["provider"], ctx.obj["model"])
click.echo(f"Fetching {url}...")
text, title = fetch_content(url)
content_hash = hashlib.sha256(text.encode()).hexdigest()
# Check for duplicate
existing = conn.execute("SELECT id, content_hash FROM sources WHERE url = ?", (url,)).fetchone()
if existing:
if existing["content_hash"] == content_hash:
if ctx.obj["json"]:
click.echo(json.dumps({"status": "duplicate", "title": title, "url": url}))
else:
click.echo(f"Already saved (unchanged): {title or url}")
conn.close()
return
conn.execute("DELETE FROM chunks WHERE source_id = ?", (existing["id"],))
conn.execute(
"UPDATE sources SET title=?, content_hash=?, saved_at=? WHERE id=?",
(title, content_hash, datetime.now(timezone.utc).isoformat(), existing["id"]),
)
source_id = existing["id"]
click.echo("Content updated, re-embedding...")
else:
domain = urlparse(url).netloc
cur = conn.execute(
"INSERT INTO sources (url, title, domain, saved_at, content_hash) VALUES (?,?,?,?,?)",
(url, title, domain, datetime.now(timezone.utc).isoformat(), content_hash),
)
source_id = cur.lastrowid
chunks = chunk_text(text)
click.echo(f"Embedding {len(chunks)} chunks via {provider}/{model}...")
embeddings = embed_texts(chunks, provider, model, api_key)
for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
conn.execute(
"INSERT INTO chunks (source_id, content, embedding, chunk_index) VALUES (?,?,?,?)",
(source_id, chunk, emb.tobytes(), i),
)
conn.commit()
conn.close()
if ctx.obj["json"]:
click.echo(json.dumps({"status": "saved", "title": title, "url": url, "chunks": len(chunks), "provider": provider, "model": model}))
else:
click.echo(f"Saved: {title or url}")
click.echo(f" {len(chunks)} chunks, {len(text)} chars ({provider}/{model})")
@cli.command()
@click.argument("query")
@click.option("-k", default=5, help="Number of results")
@click.pass_context
def search(ctx, query, k):
"""Semantic search over saved knowledge."""
conn = get_db(ctx.obj["db_path"])
provider, model, api_key = resolve_provider(conn, ctx.obj["provider"], ctx.obj["model"])
query_emb = embed_texts([query], provider, model, api_key)[0]
rows = conn.execute(
"""
SELECT c.id, c.content, c.embedding, c.chunk_index,
s.url, s.title, s.saved_at, s.domain
FROM chunks c JOIN sources s ON c.source_id = s.id
WHERE c.embedding IS NOT NULL
"""
).fetchall()
if not rows:
click.echo("No saved content to search.")
conn.close()
return
results = []
for r in rows:
emb = np.frombuffer(r["embedding"], dtype=np.float32)
sim = cosine_sim(query_emb, emb)
results.append({
"title": r["title"] or r["url"],
"url": r["url"],
"domain": r["domain"],
"chunk": r["content"],
"score": round(sim, 4),
"saved_at": r["saved_at"],
})
results.sort(key=lambda x: x["score"], reverse=True)
results = results[:k]
conn.close()
if ctx.obj["json"]:
click.echo(json.dumps(results, indent=2))
else:
for i, r in enumerate(results, 1):
preview = r["chunk"][:300] + "..." if len(r["chunk"]) > 300 else r["chunk"]
click.echo(f"\n{'โ' * 60}")
click.echo(f" [{i}] {r['title']} (score: {r['score']:.3f})")
click.echo(f" {r['url']}")
click.echo(f" Saved: {r['saved_at'][:10]}")
click.echo(f" {preview}")
click.echo(f"\n{'โ' * 60}")
@cli.command("list")
@click.option("--limit", default=20, help="Max sources to show")
@click.pass_context
def list_sources(ctx, limit):
"""List saved sources."""
conn = get_db(ctx.obj["db_path"])
rows = conn.execute(
"""
SELECT s.id, s.url, s.title, s.domain, s.saved_at,
COUNT(c.id) as chunk_count
FROM sources s LEFT JOIN chunks c ON s.id = c.source_id
GROUP BY s.id ORDER BY s.saved_at DESC LIMIT ?
""",
(limit,),
).fetchall()
conn.close()
if ctx.obj["json"]:
click.echo(json.dumps([dict(r) for r in rows], indent=2))
else:
if not rows:
click.echo("No saved sources.")
return
for r in rows:
click.echo(f" [{r['id']}] {r['title'] or r['url']}")
click.echo(f" {r['url']}")
click.echo(f" {r['saved_at'][:10]} ยท {r['chunk_count']} chunks")
click.echo()
@cli.command()
@click.pass_context
def stats(ctx):
"""Show database stats."""
conn = get_db(ctx.obj["db_path"])
sources = conn.execute("SELECT COUNT(*) FROM sources").fetchone()[0]
chunks = conn.execute("SELECT COUNT(*) FROM chunks").fetchone()[0]
provider = get_meta(conn, "embedding_provider") or "not set"
model = get_meta(conn, "embedding_model") or "not set"
conn.close()
db_path = Path(ctx.obj["db_path"])
db_size = db_path.stat().st_size if db_path.exists() else 0
size_str = f"{db_size / 1024:.1f} KB" if db_size < 1048576 else f"{db_size / 1048576:.1f} MB"
if ctx.obj["json"]:
click.echo(json.dumps({"sources": sources, "chunks": chunks, "db_size_bytes": db_size, "provider": provider, "model": model}))
else:
click.echo(f" Sources: {sources}")
click.echo(f" Chunks: {chunks}")
click.echo(f" DB size: {size_str}")
click.echo(f" Provider: {provider}")
click.echo(f" Model: {model}")
@cli.command()
@click.argument("url_or_id")
@click.pass_context
def delete(ctx, url_or_id):
"""Delete a source and its chunks."""
conn = get_db(ctx.obj["db_path"])
try:
sid = int(url_or_id)
row = conn.execute("SELECT * FROM sources WHERE id = ?", (sid,)).fetchone()
except ValueError:
row = conn.execute("SELECT * FROM sources WHERE url = ?", (url_or_id,)).fetchone()
if not row:
raise click.ClickException(f"Source not found: {url_or_id}")
conn.execute("DELETE FROM chunks WHERE source_id = ?", (row["id"],))
conn.execute("DELETE FROM sources WHERE id = ?", (row["id"],))
conn.commit()
conn.close()
if ctx.obj["json"]:
click.echo(json.dumps({"status": "deleted", "title": row["title"], "url": row["url"]}))
else:
click.echo(f"Deleted: {row['title'] or row['url']}")
@cli.command()
@click.pass_context
def providers(ctx):
"""List available embedding providers."""
if ctx.obj["json"]:
click.echo(json.dumps({k: {"model": v["model"], "env": v["env"]} for k, v in PROVIDERS.items()}, indent=2))
else:
click.echo("Available embedding providers:\n")
for name, info in PROVIDERS.items():
env = info["env"] or "(none โ local)"
key_set = "โ" if (not info["env"] or os.environ.get(info["env"])) else "โ"
click.echo(f" {name:10s} model: {info['model']:30s} env: {env:20s} [{key_set}]")
click.echo()
db_path = Path(ctx.obj["db_path"])
if db_path.exists():
conn = get_db(str(db_path))
p = get_meta(conn, "embedding_provider")
m = get_meta(conn, "embedding_model")
conn.close()
if p:
click.echo(f" This database uses: {p}/{m}")
if __name__ == "__main__":
cli()
skill.json
561 B
{
"name": "knowd",
"version": "1.0.0",
"description": "Personal knowledge base โ save web pages and search them semantically. Multi-provider embeddings (OpenAI, Voyage, Cohere, Jina, Ollama).",
"author": "Ian Cook",
"license": "MIT",
"category": "knowledge",
"agents": ["claude-code", "openclaw"],
"requires": {
"bins": ["python3"],
"env": {
"oneOf": ["OPENAI_API_KEY", "VOYAGE_API_KEY", "COHERE_API_KEY", "JINA_API_KEY"]
}
},
"keywords": ["knowledge-base", "semantic-search", "embeddings", "bookmarks", "web-clipper"]
}
Compatible Agents
Claude CodeclaudeCodexOpenClawAntigravityGemini
Details
- Category
- Uncategorized
- Version
- 1.0.0
- Stars
- 0
- Added
- February 11, 2026
- Updated
- February 11, 2026