|
We are developing a search engine. Our spider has come across many of these mirror sites that contain redirects to the same parked domain. It is also possible these sites are cloaking for our spider.
What is happening is that these sites are disguising themselves under very similar domain names - the difference may be one punctuation mark or letter.
http://chinese-food--nutrition-facts.die...
http://chinese-food--nutrition-facts.nut...
http://chinese-food-nutrition-informatio...
http://chinese-food---nutrition.best-die...
These domains all redirect to a diet pill site.
These parked domains usually contain advertising and would normally be booted off of Google and Yahoo.
How can a spider detect cloaking or situations like these such that we can give these sites a poor ranking or remove them automatically from our index?
How can we identify a mirror website?
Unfortunately it would be hard to do without actually going through every link and comparing page content. I'm sure Google's search engine has a big database to compare from.
|