crawlbin

crawlbin is designed for testing web crawlers against various search engine directives and HTTP responses.

Inspired by httpbin.

Introduction

crawlbin accepts a list of flags in a URL which toggle various search engine directives and HTTP responses within the complete HTML page it returns. You can simulate a variety of issues which crawlers might encounter.

For example:

http://crawlbin.com/response_404/HTTP Status: 404 (Not Found) http://crawlbin.com/meta_noindex/<meta name="robots" content="noindex" />

You can join flags together with +: http://crawlbin.com/meta_noindex+meta_nofollow+response_410/HTTP Status: 410 (Gone)<meta name="robots" content="nofollow noindex" />

You can create redirects from one block of flags to another with a /: http://crawlbin.com/canonical_self+vary_user_agent/response_301/HTTP Status: 301 (Permanent Redirect)http://crawlbin.com/canonical_self+vary_user_agent/Vary: User-Agent<link rel="canonical" href="http://crawlbin.com/canonical_self+vary_user_agent/" />

You can also create randomised requests and requests that target certain user agents:

http://crawlbin.com/[all:response_404][mobile:response_500][desktop:meta_noindex]/

Mobile devices: HTTP Status: 500 (Internal Server Error) OR HTTP Status: 404 (Not Found)
Desktop devices: <meta name="robots" content="noindex" /> OR HTTP Status: 404 (Not Found)
Other devices: HTTP Status: 404 (Not Found)

Table of Contents

  1. Full List of Flags
  2. Randomised responses
  3. User-agent / device determined responses
  4. FAQ

Full List of Flags

crawlbin can take a variety of flags with various options. Here is the complete list.

  1. Response Codes & Redirects
  2. no/index directive
  3. no/follow directive
  4. canonical directive
  5. vary header

Response Codes & Redirects

There are multiple HTTP response codes that are supported. With most response codes other flags will also still work so, for example, it is possible to craft a 404 page that has a canonical HTTP header, and two H1 tags. However, note that for response codes that redirect to a new page, no body content is returned.

The list of non-redirect response codes supported is:

http://crawlbin.com/response_400/ HTTP Status: 400 (Bad Request) http://crawlbin.com/response_401/ HTTP Status: 401 (Unauthorized) http://crawlbin.com/response_403/ HTTP Status: 403 (Forbidden) http://crawlbin.com/response_404/ HTTP Status: 404 (Not Found) http://crawlbin.com/response_410/ HTTP Status: 410 (Gone) http://crawlbin.com/response_418/ HTTP Status: 418 (I'm a teapot) http://crawlbin.com/response_500/ HTTP Status: 500 (Internal Server Error) http://crawlbin.com/response_503/ HTTP Status: 503 (Service Unavailable)
Redirects

When a redirect code is returned the associated 'Location' header is set to the next level 'down' of the URL as determined by a delimiting /:

http://crawlbin.com/canonical_self+vary_user_agent/response_301/HTTP Status: 301 (Permanent Redirect)http://crawlbin.com/canonical_self+vary_user_agent/Vary: User-Agent<link rel="canonical" href="http://crawlbin.com/canonical_self+vary_user_agent/" />

The full list of supported redirect response codes:

http://crawlbin.com/response_301/ HTTP Status: 301 (Moved Permanently) http://crawlbin.com/response_302/ HTTP Status: 302 (Found) http://crawlbin.com/response_303/ HTTP Status: 303 (See Other) http://crawlbin.com/response_307/ HTTP Status: 307 (Temporary Redirect) http://crawlbin.com/response_308/ HTTP Status: 308 (Permanent Redirect)

no/index directive

The 'noindex' directive is supported in both meta tag and HTTP header forms:

http://crawlbin.com/meta_noindex/ <meta name="robots" content="noindex" /> http://crawlbin.com/header_noindex/ X-Robots-Tag: noindex

You can specify both 'index' and 'noindex' as values:

http://crawlbin.com/header_index/ X-Robots-Tag: index

Just like other tags, they can be used together:

http://crawlbin.com/header_noindex+meta_noindex/ X-Robots-Tag: noindex <meta name="robots" content="noindex" />

Conflicting values are allowed (either across the two forms, or within the same form):

http://crawlbin.com/header_noindex+header_index+meta_index+meta_noindex/ X-Robots-Tag: noindex, index <meta name="robots" content="noindex, index" />

no/follow directive

The 'nofollow' directive is supported in both meta tag and HTTP header forms, and works similarly to the 'noindex' directive above:

http://crawlbin.com/meta_nofollow/ <meta name="robots" content="nofollow" /> http://crawlbin.com/header_follow/ X-Robots-Tag: follow

H1 tags

By default crawlbin serves a page with a single H1 header tag in the HTML source. However, there are options to have a page with zero or two H1 tags:

http://crawlbin.com/h1_off/ <h1>crawlbin</h1> http://crawlbin.com/h1_multiple/ <h1>crawlbin</h1> <h1>Another title</h1>

Canonical

The 'canonical' directive is supported in both meta tag and HTTP header forms. You can specify various values:

http://crawlbin.com/canonical_self/ Link: <http://crawlbin.com/canonical_self/>; rel="canonical" <link rel="canonical" href="http://crawlbin.com/canonical_self/" /> http://crawlbin.com/canonical_home/ Link: <http://crawlbin.com/>; rel="canonical" <link rel="canonical" href="http://crawlbin.com/" /> http://crawlbin.com/h1_multiple+meta_noindex/canonical_next_block/ Link: <http://crawlbin.com/h1_multiple+meta_noindex/>; rel="canonical" <link rel="canonical" href="http://crawlbin.com/h1_multiple+meta_noindex/" />

You can also specify whether you want just an HTTP header or just an HTML directive:

http://crawlbin.com/header_canonical_home/ Link: <http://crawlbin.com/>; rel="canonical" http://crawlbin.com/html_canonical_home/ <link rel="canonical" href="http://crawlbin.com/" />

And it is permitted to combine these with conflicting values:

http://crawlbin.com/h1_multiple+meta_noindex/html_canonical_next_block+header_canonical_self/ Link: <http://crawlbin.com/h1_multiple+meta_noindex/html_canonical_next_block+header_canonical_self/>; rel="canonical" <link rel="canonical" href="http://crawlbin.com/h1_multiple+meta_noindex/" />

Vary HTTP header

There are various allowed values for the Vary HTTP header:

http://crawlbin.com/vary_accept_encoding/ Vary: Accept-Encoding http://crawlbin.com/vary_user_agent/ Vary: User-Agent http://crawlbin.com/vary_cookie/ Vary: Cookie http://crawlbin.com/vary_referer/ Vary: Referer

You may join multiple values:

http://crawlbin.com/vary_user_agent+vary_cookie/ Vary: User-Agent, Cookie

Randomised responses

It is possible to supply multiple, alternative, blocks of flags which are selected at random, such that responses are less predictable. Each block must be surounded by square brackets: [] .

http://crawlbin.com/[meta_noindex+vary_cookie][response_404]/ Vary: Cookie <meta name="robots" content="noindex" /> OR HTTP Status: 404 (Not Found)

You can supply as many of these alternative blocks as you wish between any two / separated blocks.

Furthermore, there is a single level of nesting that is possible with randomisation. In this case nested sets of [] are used and the flags must be comma separated.

http://crawlbin.com/[meta_noindex+[vary_cookie,vary_referer]][response_404]/ Vary: Cookie <meta name="robots" content="noindex" /> OR Vary: Referer <meta name="robots" content="noindex" /> OR HTTP Status: 404 (Not Found)

This nesting allows for more complex sets of randomisation, but can also be combined with the user-agent targeting below.

User-agent determined responses

It is possible to target different blocks of flags towards different devices or crawlers based upo their user agent. To do this you must enclose a block of flags in square brackets [], as per above, and then prefix this block with a user agent flag and a :.

http://crawlbin.com/[mobile:response_500][meta_noindex]/

Mobile devices: HTTP Status: 500 (Internal Server Error)
Other devices: <meta name="robots" content="noindex" />

The list of user agent flags accepted and what they match to are:

  • all - All user agents.
  • bot - Any bot.
  • googlebot - Googlebot but not other bots.
  • desktop - Any desktop browser.
  • mobile - Any mobile browser.
  • tablet - Any tablet browser.
  • ie - Any version of Internet Explorer.
  • ff - Any version of Firefox.

When a request is made crawlbin will select all possible blocks which match the user-agent, and then select amongst them randomly in the same fashion as outlined above. A particular request might match against multiple user agent flags (for example desktop and ie flags may both be relevant and selected).

http://crawlbin.com/[all:response_404][mobile:response_500][desktop:meta_noindex]/

Mobile devices: HTTP Status: 500 (Internal Server Error) OR HTTP Status: 404 (Not Found)
Desktop devices: <meta name="robots" content="noindex" /> OR HTTP Status: 404 (Not Found)
Other devices: HTTP Status: 404 (Not Found)

FAQ

Do you handle POST requests?

Not currently.

Do you handle HTTPS requests?

Not currently.