Etavrian
keyboard_arrow_right Created with Sketch.
News
keyboard_arrow_right Created with Sketch.

Wilson Lin's demo search engine cuts SEO spam - but there is a catch

Reviewed:
Andrii Daniv
2
min read
Aug 18, 2025
Minimalist browser illustration privacy shield consent toggle on AR report PA audience SQL lift

Software engineer Wilson Lin released a demo search engine aimed at reducing SEO spam. You can try the demo here and read the full technical write-up here.

Tired Of SEO Spam, Software Engineer Creates A New Search Engine
Wilson Lin’s prototype targets SEO spam by rethinking retrieval and indexing.

What he built

Over two months, Lin built a prototype that retrieves results using neural embeddings with sentence-level chunking for precision. He trained a DistilBERT classifier to link sentences to their dependencies so responses include necessary context.

"I would follow the 'chain' backwards to ensure all dependents were also provided in context."

Main content extraction focused on HTML tags such as blockquote, dl, ol, p, pre, table, and ul.

Crawling and canonicalization

The crawler only fetched HTTPS URLs with valid eTLDs and hostnames, disallowing ports, usernames, and passwords in URLs. Canonicalization decoded and re-encoded components, normalized query parameters, and lowercased origins. Lin noted that DNS failures, very long URLs, and unusual characters caused downstream issues.

Infrastructure and scale

Initial infrastructure ran on Oracle Cloud, citing 10 TB of free egress per month. As the system scaled, Lin moved from PostgreSQL to 64 RocksDB shards. At peak, ingestion reached about 200,000 writes per second across thousands of clients. Each page stored raw HTML, normalized data, contextual chunks, hundreds of embeddings, and metadata.

Embedding generation began with OpenAI’s API and later shifted to self-hosted inference on Runpod GPUs, including RTX 4090 instances. Lin said Runpod offered lower per-hour rates than AWS and Lambda, along with more stable networking.

Results

In tests such as "best programming blogs" and paragraph-length queries, Lin reported fewer spammy results compared with typical engines. The public demo is available here, and the technical breakdown is here.

Takeaways

Lin’s key lessons include the importance of index coverage for quality and the difficulty of crawling and filtering at scale. He noted coverage gaps as a constraint for independent engines and highlighted the challenge of automatically assessing trust, originality, and accuracy. In a future iteration, he would prioritize evaluation methods earlier. The system’s architecture changed as scale increased, evolving from managed databases to sharded RocksDB and GPU inference.

Sources

  • Project write-up and technical details: here
  • Public demo site: here
Quickly summarize and get insighs with: 
Author
Etavrian AI
Etavrian AI is developed by Andrii Daniv to produce and optimize content for etavrian.com website.
Reviewed
Andrew Daniv, Andrii Daniv
Andrii Daniv
Andrii Daniv is the founder and owner of Etavrian, a performance-driven agency specializing in PPC and SEO services for B2B and e‑commerce businesses.
Quickly summarize and get insighs with: 
Table of contents