About the Ttek2 crawler

This page describes the bot identified by the user-agent string ttek2-bot/1.0 (+https://ttek2.com/about/crawler; contact@ttek2.com). It is what publishers will see in their access logs when this site indexes their content.

What it does

The crawler walks an explicit allowlist of ~80 technology-publication and forum hosts. It fetches HTML article pages, RSS/Atom feeds, and sitemaps. Extracted content is classified for tech relevance, deduplicated, and stored in a local SQLite/FTS5 index that powers the search and topics sections of this site. We do not republish article bodies; only titles, short snippets, and links back to the source are surfaced to readers.

Etiquette

robots.txt is honored for the user-agent above and for the wildcard User-agent: *. Disallowed paths are not fetched, even if discovered through outlinks.
Crawl-delay directives override our default per-host pacing when they are higher.
Per-domain caps limit fetches to a configurable rolling 24-hour ceiling per host (default 200/day; lower for smaller sites).
HTTP 429 / 503 / 509 responses trigger an exponential backoff (5m → 15m → 1h → 4h → 24h) on the affected host, automatically lifted once the host responds normally again.
Conditional GET (If-None-Match / If-Modified-Since) is sent on every refetch, so unchanged pages return 304 and avoid bandwidth waste.
noindex and noai meta tags are obeyed; pages carrying either are dropped without indexing.
Paywalls / authentication walls are not bypassed. We do not solve CAPTCHAs, follow login redirects, or use shared credentials.
No JavaScript execution. Pages that require JS to render their main content are skipped or fetched via the publisher's RSS/sitemap if available.

How to opt out

If you would like your site removed from the index, you can:

Block via robots.txt. Add the following block — we will stop fetching within 24 hours of the directive being live:

User-agent: ttek2-bot
Disallow: /

Email us at contact@ttek2.com with the host(s) you want removed. Removal is permanent (we add the host to a deny-list so future re-discovery via outlinks does not re-add it). Already-indexed documents from that host are removed from the next scheduled re-index pass (within 24h).

Specific URL takedowns are also handled at the same address — please include the URL(s).

If a request is urgent (e.g. content that should never have been published), reply with subject urgent and we will process it the same day.

What we do NOT do

We do not train AI models on crawled article bodies.
We do not redistribute full-text or images of article bodies; only titles, short snippets (~30 words), and a link back to the source are exposed to readers.
We do not run third-party advertising on indexed content. The site has no ad network.
We do not sell access to the index. There is no public scraping API.
We do not impersonate browsers to evade anti-bot measures. The user-agent string above is what every fetch sends, with two well-documented exceptions: a small allowlist of hosts behind aggressive Cloudflare anti-bot rules (configured in crawler.json per host) where the bot identifies as a stock Firefox UA. This is not done to bypass robots.txt; those hosts' robots.txt is still honored.

Volume and cadence

Average: 1,000–4,000 article fetches per day across the full allowlist.
Peak per-host: 200/day (configurable lower).
Default per-host crawl delay: 2 seconds.

Source code

The crawler is implemented in core/CrawlFrontier.php, core/HtmlFetcher.php, core/RobotsCache.php, and core/CrawlerPipeline.php. The orchestrator that schedules batches is admin/crawler-orchestrator.php. Our allowlist and per-host overrides live in content/config/crawler.json.

Contact

Removal / takedown / questions: contact@ttek2.com
Security: same address — please use "security" in the subject line.

Followed topics