This page describes the bot identified by the user-agent string ttek2-bot/1.0 (+https://ttek2.com/about/crawler; contact@ttek2.com). It is what publishers will see in their access logs when this site indexes their content.
What it does
The crawler walks an explicit allowlist of ~80 technology-publication and forum hosts. It fetches HTML article pages, RSS/Atom feeds, and sitemaps. Extracted content is classified for tech relevance, deduplicated, and stored in a local SQLite/FTS5 index that powers the search and topics sections of this site. We do not republish article bodies; only titles, short snippets, and links back to the source are surfaced to readers.
Etiquette
- robots.txt is honored for the user-agent above and for the wildcard
User-agent: *. Disallowed paths are not fetched, even if discovered through outlinks. Crawl-delaydirectives override our default per-host pacing when they are higher.- Per-domain caps limit fetches to a configurable rolling 24-hour ceiling per host (default 200/day; lower for smaller sites).
- HTTP 429 / 503 / 509 responses trigger an exponential backoff (5m → 15m → 1h → 4h → 24h) on the affected host, automatically lifted once the host responds normally again.
- Conditional GET (
If-None-Match/If-Modified-Since) is sent on every refetch, so unchanged pages return 304 and avoid bandwidth waste. noindexandnoaimeta tags are obeyed; pages carrying either are dropped without indexing.- Paywalls / authentication walls are not bypassed. We do not solve CAPTCHAs, follow login redirects, or use shared credentials.
- No JavaScript execution. Pages that require JS to render their main content are skipped or fetched via the publisher's RSS/sitemap if available.
How to opt out
If you would like your site removed from the index, you can:
- Block via robots.txt. Add the following block — we will stop fetching within 24 hours of the directive being live:
User-agent: ttek2-bot
Disallow: /
- Email us at contact@ttek2.com with the host(s) you want removed. Removal is permanent (we add the host to a deny-list so future re-discovery via outlinks does not re-add it). Already-indexed documents from that host are removed from the next scheduled re-index pass (within 24h).
- Specific URL takedowns are also handled at the same address — please include the URL(s).
If a request is urgent (e.g. content that should never have been published), reply with subject urgent and we will process it the same day.
What we do NOT do
- We do not train AI models on crawled article bodies.
- We do not redistribute full-text or images of article bodies; only titles, short snippets (~30 words), and a link back to the source are exposed to readers.
- We do not run third-party advertising on indexed content. The site has no ad network.
- We do not sell access to the index. There is no public scraping API.
- We do not impersonate browsers to evade anti-bot measures. The user-agent string above is what every fetch sends, with two well-documented exceptions: a small allowlist of hosts behind aggressive Cloudflare anti-bot rules (configured in
crawler.jsonper host) where the bot identifies as a stock Firefox UA. This is not done to bypassrobots.txt; those hosts' robots.txt is still honored.
Volume and cadence
- Average: 1,000–4,000 article fetches per day across the full allowlist.
- Peak per-host: 200/day (configurable lower).
- Default per-host crawl delay: 2 seconds.
Source code
The crawler is implemented in core/CrawlFrontier.php, core/HtmlFetcher.php, core/RobotsCache.php, and core/CrawlerPipeline.php. The orchestrator that schedules batches is admin/crawler-orchestrator.php. Our allowlist and per-host overrides live in content/config/crawler.json.
Contact
- Removal / takedown / questions: contact@ttek2.com
- Security: same address — please use "security" in the subject line.