Web crawlers. How to block web crawlers
When you put your website domain as the Origin-Source, the copy of your website that located on your personal domain name (cname) could be indexed by search engines.
If you want to prevent it:
-
Create a folder on the origin and add a robots.txt file with the following settings:
User-agent: *
Disallow: / -
Create a Rule for your CDN-resource with the following settings:
Match Type: Regular expression
Rule pattern: robots.*
Rewrite: /(.*) /folder/$1
Where folder is the name of the folder that you’ve created in the first step.
Example:
How it is going to work?
The robots.txt file controls how search engine spiders see and interact with your webpage.
The added rule allows us to rewrite a path for robots.txt that will be used by web crawlers. For example, if your personal domain is cdn.domain.com, the search engine crawlers will request the cdn.domain.com/robots.txt URL which contains restrictions for indexing.
Consequently, domain cdn.domain.com won’t be indexed.
! These settings don’t affect the website itself.