AMP Crawler Advanced Configurations

  • Updated

There are a number of additional settings that can be configured for AMP Web Tests. These settings can enable you to test content while removing repetitiveness in testing.

Viewport Size - Allows the user to change the spiders viewport size to emulate browser size. Users can select various viewport sizes to correspond to what they would like the browser to emulate. 

User Agent - The User Agent feature passes the user agent string for the browser on the selected devise through Firefox and then re-sizes the browser window to match the actual device size.  After that, AMP tests the rendered DOM in the browser. The default browser is 'Chrome'; the chosen browser is used to render the page on the server and then perform the automated testing. 

Maximum Page Count - The total number of pages that the spider should attempt to collect.

Maximum Depth - Defines the number how much branching out the spider should perform as it runs. The extreme minimum, a depth of zero, case would be spidering only the start location. In general, depth should be set to at least a depth of one which means that AMP would diagnose the base page, and also any page that is linked directly off of the base page. A depth of two means the spider should diagnose the base page, pages linked from the base page, and also pages linked from those pages - conceptually, "grandchildren" or third-generation pages. 

Maximum Argument Count - Defines the number of unique argument pages that should be captured for a given base URL. As an example, let's take the URL https://www.acme.com/prod.php&product_id=123. This URL has one argument, which is product_id=123. If the Max Argument Account were set to 5, AMP would only test the first 5 pages that it finds that have the argument 'product_id'. After 5, it wouldn't spider the additional pages.

Maximum Argument Count = 5

www.acme.com/prod.php&product_id=123 (would be spidered)

www.acme.com/prod.php&product_id=124 (would be spidered)

www.acme.com/prod.php&product_id=156 (would be spidered)

www.acme.com/prod.php&product_id=158 (would be spidered)

www.acme.com/prod.php&product_id=155 (would be spidered)

www.acme.com/prod.php&product_id=876 (would NOT be spidered, as it is the 6th argument of the same kind and the maximum argument count is set to 5)

www.acme.com/prod.php&product_id=743&type=onsale (would be spidered based on the addition of a new argument in the URL)

Positive Filters - Defines a path that the spider should follow using a regular expression (AMP follows standard PHP syntax). These can be used to specify particular pages or hosts that you would like the spider to examine - essentially extending the Page Restriction setup. Positive Filters are | (pipe) delimited. As an example, a few regular expressions have been provided below:

    • To implement the 'Restrict to Path' requirement the positive filter is set to .*www\.levelaccess\.com/contact.php.*
    • To implement the 'Restrict to Host' requirement the positive filter is set to .*www\.levelaccess\.com.*
    • To implement the 'Restrict to Domain' requirement the positive filter is set to .*levelaccess\.com.*

Negative Filters - Defines a regular expression that the spider should not follow. These can be used to specify particular pages that you do not want the spider to visit. A good example of this would be .*cgi-bin.* which would ignore all URLs that contained "cgi-bin". Negative Filters are | (pipe) delimited.

XPath Exclusion - This field allows you to specify a common 'section' of the site to exclude from being automatically tested during the spidering process. For example a specific div element. XPath Exclusions should be formatted as follows: /htm/body/div[@id="RandomAds"] To add additional XPath Exclusions, XPaths are separated using a comma (ex. /html/body, /html/body/div[@id="RandomAds"], /html/head).

Note, that the XPath Exclusion field supports full XPath Syntax, including, for example, the not() XPath expression, as in:

//*[not(ancestor-or-self::main)]

or

//*[not(ancestor-or-self::footer)],//*[not(ancestor-or-self::header)]

which would, respectively, exclude all elements that are not from the <main> element or any of its children, and for the second example, those that are not part of the <header>, <footer>, or any of their children. You can utilize this flexibility to narrow in on the specific portion of the page you'd like to test in your spiders.

Publish Document Inventory - Selecting this option will cause AMP to catalog any documents (e.g. .pdf, .doc, .txt, .xls, etc) it spiders and create a list of all the documents and the document locations it finds. It will also pick up images if there are links to images at a URL.

Scope - Defines the basic restriction on the type of pages that should be spidered.

  • Path Restriction will ensure that only pages that are present at or below the path of the Start Location will be spidered. So for example if the Start Location is http://www.levelaccess.com/contact.php only pages in the contact directory or its sub-directories on the www.levelaccess.com server will be spidered.
  • Restrict to Host will ensure that only pages that are present on the Start Locations host will be spidered. So for example if the Start Location is http://www.levelaccess.com/contact.php any pages on www.levelaccess.com will be spidered.
  • Restrict to Domain will ensure that only pages that are present on the Start Locations domain will be spidered. So for example if the Start Location is http://www.levelaccess.com/contact.php pages on www.levelaccess.com, amp.levelaccess.com or any other server in the levelaccess.com domain will be spidered

Ignore IFrames - IFrames will be ignored during tested and no violations will be reported.

Apply Scope to IFrames - If the src attribute of an iframe is present then we resolve the location to a full URL and pass it through the positive filters (which get generated automatically from the scope). If the iframe source location is not present or passes the filter then it will be tested. Otherwise, it will be excluded from testing.

Ignore URL Fragment - The URL fragment will be ignored during link harvesting to reduce the number of duplicates urls being tested.

Ignore Query String - URL query strings will be ignored during link harvesting to reduce the number of duplicates urls being tested.

Was this article helpful?

2 out of 3 found this helpful

Have more questions? Submit a request