Last updated on
In a recent LinkedIn post, Gary Illyes, a Google Analyst, reinforced timeless advice for website owners: Utilize the robots.txt file to block web crawlers from accessing URLs that trigger actions such as adding items to carts or wishlists.
Illyes emphasized the frequent issue of excessive crawler traffic overwhelming servers, often originating from search engine bots indexing URLs meant for user interactions.
He stated:
“Upon examining the content we crawl from the sites in question, it’s all too common to find action URLs like ‘add to cart’ and ‘add to wishlist.’ These hold no value for crawlers and are likely not intended for indexing.”
To mitigate this unnecessary server strain, Illyes recommended restricting access in the robots.txt file for URLs containing parameters like “?add_to_cart” or “?add_to_wishlist.”
Illustrating with an example, he proposed:
“If your URLs resemble:
https://example.com/product/scented-candle-v1?add_to_cart
and
https://example.com/product/scented-candle-v1?add_to_wishlist
It’s advisable to include a disallow rule for them in your robots.txt file.”
While utilizing the HTTP POST method can also deter the crawling of such URLs, Illyes cautioned that crawlers can still execute POST requests, underscoring the continued relevance of robots.txt.
Alan Perkins, contributing to the discussion, highlighted that this guidance reflects web standards established in the 1990s for analogous reasons.
Quoting from a 1993 document titled “A Standard for Robot Exclusion”:
“In 1993 and 1994, there were instances where robots visited WWW servers where they weren’t welcome for various reasons… robots traversed sections of WWW servers that weren’t suitable, such as very deep virtual trees, duplicated information, temporary data, or cgi-scripts with side-effects (like voting).”
The robots.txt standard, proposing regulations to limit well-mannered crawler access, emerged as a “consensus” solution among web stakeholders in 1994.
Illyes confirmed that Google’s crawlers consistently adhere to robots.txt rules, with rare exceptions meticulously documented for cases involving “user-triggered or contractual fetches.”
This commitment to the robots.txt protocol has long been a cornerstone of Google’s web crawling policies.
Although the guidance may appear basic, the resurgence of this decades-old best practice emphasizes its continued relevance.
Through utilizing the robots.txt standard, websites can effectively manage overly enthusiastic crawlers, preventing them from monopolizing bandwidth with futile requests.
Whether you manage a modest blog or a bustling e-commerce hub, adhering to Google’s recommendation to utilize robots.txt for restricting crawler access to action URLs can yield several advantages:
Reassessing robots.txt directives could serve as a straightforward yet impactful measure for websites seeking to exert greater control over crawler behavior.
Illyes’ messaging underscores the enduring relevance of these age-old robots.txt rules in our contemporary web landscape.
Original news from SearchEngineJournal