How one can Repair “listed, although blocked by robots.txt” in GSC

“Indexed although blocked by robots.txt” is displayed in the Google Search Console (GSC) if Google has indexed URLs that cannot be crawled.

For the most part, this is a simple problem where you’ve blocked crawling in your robots.txt file. However, there are a few additional conditions that can trigger the problem. So let’s go through the following troubleshooting process to help diagnose and fix the issues as efficiently as possible:

You can see that the first step is to ask yourself if you want Google to index this Url.

If you don’t want that Url indexed …

Just add a no-index meta robots tag and make sure it’s allowed to crawl – assuming it’s canonical.

If you block a page from crawling, Google may keep indexing it because crawling and indexing are two different things. If Google can’t crawl a page, the noindex meta tag won’t appear and it might still index because it has links.

If the Url canonicalized to another page, don’t add a noindex meta robots tag. Just make sure the correct canonicalization signals are in place, including a canonical tag on the canonical side, and allow crawling so the signals are properly routed and consolidated.

If you want Url indexed …

You need to find out why Google can’t crawl this Url and remove the block.

The most likely cause is a crawl block in robots.txt. However, there are a few other scenarios that you might see messages that you are blocked. Let’s go through these in the order you should probably be looking for them.

  1. Look for a crawl block in robots.txt
  2. Check for intermittent blocks
  3. Look for a user agent block
  4. Look for one IP block

Look for a crawl block in robots.txt

The easiest way to spot the problem is with the robots.txt tester in GSC, which identifies the blocking rule.

2-robots-tester.gif

When you know what you’re looking for or don’t have access to it GSCyou can navigate to domain.com/robots.txt to find the file. We have more information in our robots.txt article, but you’re probably looking for an improper statement like:

Do not allow: /

A specific user agent might be mentioned or everyone might be blocked. If your website is new or recently launched, you might want to find:

User agent: *
Do not allow: /

Can’t find a problem?

Someone may already have fixed the robots.txt block and fixed the problem before you even deal with the problem. That is the best scenario. However, if the problem seems to be resolved but reoccurs shortly afterwards, there may be an intermittent block.

How to fix

You want to remove the disallow statement that is causing the block. How you do this depends on the technology used.

WordPress

If the problem is affecting your entire website, the most likely cause is that you have enabled a setting in WordPress to not allow indexing. This error is common with new websites and after subsequent website migrations. Follow these steps to check this:

  1. Click on “Settings”.
  2. Click on “Read”.
  3. Make sure that “Search Engine Visibility” is turned off.

3-wordpress-search-engine-block.png

WordPress with Yoast

If you are using the Yoast SEO Plugin, you can edit the robots.txt file directly to remove the blocking instruction.

  1. Click on “Yoast” SEO
  2. Click on “Tools”.
  3. Click on “File Editor”.
WordPress with Rank Math

Similar to Yoast, Rank Math lets you edit the robots.txt file directly.

  1. Click on Rank Math.
  2. Click on “General Settings”.
  3. Click on “Edit Robots.txt”.
FTP or hosting

If you have FTP When accessing the site, you can edit the robots.txt file directly to remove the Disallow statement that is causing the problem. Your hosting provider can also give you access to a file manager that gives you direct access to the robots.txt file.

Check for intermittent blocks

It can be more difficult to troubleshoot intermittent issues because the conditions causing the block may not always be there.

I would recommend checking the history of your robots.txt file. For example in the GSC robots.txt tester: Clicking the drop-down list reveals previous versions of the file that you can click and their contents.

4-historical-robots-txt.gif

The wayback machine on archive.org also keeps a history of the robots.txt files for the websites they are crawling. You can click on any of the dates for which there is data and see what the file contains on that particular day.

5-wayback-machine.png

Or, use the beta version of Change Report, which makes it easy to spot content changes between two different versions.

6-wayback-machine.gif

How to fix

The process of fixing intermittent blocks depends on what is causing the problem. For example, a possible cause would be a shared cache between a test environment and a live environment. If the cache from the test environment is active, the robots.txt file may contain a blocking instruction. If the cache is active from the live environment, the site may be crawlable. In this case, you want to split the cache or possibly exclude .txt files from the cache in the test environment.

Look for user agent blocks

User agent blocks are when a site blocks a specific user agent such as Googlebot or AhrefsBot. In other words, the site detects a particular bot and blocks the corresponding user agent.

If you can view a page in your normal browser but get blocked after changing your user agent, it means that the user agent you entered is blocked.

You can specify a specific user agent using Chrome devtools. Another option is to use a browser extension to change user agents like this one.

Alternatively, you can search for user agent blocks with a cURL command. This is how it works under Windows:

  1. Press Windows + R to open a Run box.
  2. Enter “cmd” and click on “okay. ”
  3. Enter a cURL command like this:

curl -A “user-agent-name-here” -Lv [URL]curl -A “Mozilla / 5.0 (compatible; AhrefsBot / 7.0; + http: //ahrefs.com/robot/)” -Lv https://ahrefs.com

How to fix

Unfortunately, this is another case where if you know how to fix it, it depends where you can find the block. Many different systems can block a bot including .htaccess, server configuration, firewalls, CDNor even something you may not be able to see that your hosting provider controls. It is best to contact your hosting provider or CDN and ask them where the block came from and how to solve it.

For example, here are two different ways to block a user agent in .htaccess that you might need to look for.

RewriteEngine On
RewriteCond% {HTTP_USER_AGENT} Googlebot [NC]RewriteRule. * – [F,L]

Or…

BrowserMatchNoCase “Googlebot” bots
Order Allow, Deny
Allow EVERYONE
Deny env = bots

Check on IP blocks

If you’ve confirmed you weren’t blocked by robots.txt and ruled out user-agent blocking, this is likely one IP Block.

How to fix

IP Blocks are hard to find. As with user-agent blocks, the best option is to contact your hosting provider or CDN and ask them where the block came from and how to solve it.

Here is an example of something you might be looking for in .htaccess:

deny of 123,123,123,123

Final thoughts

Most of the time the warning “indexed although blocked by robots.txt” results from a robots.txt block. Hopefully, this guide has helped you find and fix the problem if you didn’t.

Have any questions? Let me know on Twitter.

Comments are closed.