Tôi xin tập hợp các file robots.txt từ rất nhiều các blog và các webiste nổi tiểng ở các lĩnh vực khác nhau để các bạn tham khảo.
Vài lời nhận xét :
- Chỉ có 2 trong số 30 website và blog mà chúng tôi kiểm tra là không sử dụng file robots.txt
- Ngay cả khi bạn không có một yêu cầu đặc biệt nào cho con bot tìm kiếm thì bạn vẫn nên sử dụng file robots.txt.
- Hầu hết họ đều sử dụng thuộc tính “User-agent:*” để kiểm soát và cho phép các bộ máy tìm kiếm.
- Họ sử dụng “Disallow” nhiều nhất là để chặn RSS Feed.
- Có một số site còn sử dụng cả URL của sitemap trong file robots.txt.
Những người sử dụng file robots.txt một cách rất hạn chế
Problogger.net
User-agent: *
Disallow:
Marketing Pilgrim
User-agent: *
Disallow:
Search Engine Journal
User-agent: *
Disallow:
Matt Cutts
User-agent: *
Allow:
User-agent: *
Disallow: /files/
Pronet Advertising
User-agent: *
Disallow: /mt
Disallow: /*.cgi$
TechCrunch
User-agent: *
Disallow: /*/feed/
Disallow: /*/trackback/
Những người sử dụng file robot.txt với rất nhiều quy định
Online Marketing Blog
User-agent: Googlebot
Disallow: */feed/
User-agent: *
Disallow: /Blogger/
Disallow: /wp-admin/
Disallow: /stats/
Disallow: /cgi-bin/
Disallow: /2005x/
Shoemoney
User-Agent: Googlebot
Disallow: /link.php
Disallow: /gallery2
Disallow: /gallery2/
Disallow: /category/
Disallow: /page/
Disallow: /pages/
Disallow: /feed/
Disallow: /feed
Scoreboard Media
User-agent: *
Disallow: /cgi-bin/
User-agent: Googlebot
Disallow: /category/
Disallow: /page/
Disallow: */feed/
Disallow: /2007/
Disallow: /2006/
Disallow: /wp-*
SEOMoz.org
User-agent: *
Disallow: /blogdetail.php?ID=537
Disallow: /blog?page
Disallow: /blog/author/
Disallow: /blog/category/
Disallow: /tracker
Disallow: /ugc?page
Disallow: /ugc/author/
Disallow: /ugc/category/
Wolf-Howl
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /noindex/
Disallow: /privacy-policy/
Disallow: /about/
Disallow: /company-biographies/
Disallow: /press-media-room/
Disallow: /newsletter/
Disallow: /contact-us/
Disallow: /terms-of-service/
Disallow: /terms-of-service/
Disallow: /information/comment-policy/
Disallow: /faq/
Disallow: /contact-form/
Disallow: /advertising/
Disallow: /information/licensing-information/
Disallow: /2005/
Disallow: /2006/
Disallow: /2007/
Disallow: /2008/
Disallow: /2009/
Disallow: /2004/
Disallow: /*?*
Disallow: /page/
Disallow: /iframes/
John Chow
sitemap: http://www.johnchow.com/sitemap.xml
User-agent: *
Disallow: /cgi-bin/
Disallow: /go/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /author/
Disallow: /page/
Disallow: /category/
Disallow: /wp-images/
Disallow: /images/
Disallow: /backup/
Disallow: /banners/
Disallow: /archives/
Disallow: /trackback/
Disallow: /feed/
User-agent: Googlebot-Image
Allow: /wp-content/uploads/
User-agent: Mediapartners-Google
Allow: /
User-agent: duggmirror
Disallow: /
Smashing Magazine
Sitemap: http://www.smashingmagazine.com/sitemap.xml
User-agent: Mediapartners-Google*
Disallow:
User-agent: *
Disallow: /styles/
Disallow: /inc/
Disallow: /tag/
Disallow: /cc/
Disallow: /category/
User-agent: MSIECrawler
Disallow: /
User-agent: psbot
Disallow: /
User-agent: Fasterfox
Disallow: /
User-agent: Slurp
Crawl-delay: 200
Gizmodo
User-Agent: Googlebot
Disallow: /index.xml$
Disallow: /excerpts.xml$
Allow: /sitemap.xml$
Disallow: /*view=rss$
Disallow: /*?view=rss$
Disallow: /*format=rss$
Disallow: /*?format=rss$
Sitemap: http://gizmodo.com/sitemap.xml
Lifehacker
User-Agent: Googlebot
Disallow: /index.xml$
Disallow: /excerpts.xml$
Allow: /sitemap.xml$
Disallow: /*view=rss$
Disallow: /*?view=rss$
Disallow: /*format=rss$
Disallow: /*?format=rss$
Sitemap: http://lifehacker.com/sitemap.xml
Các site Media
Wall Street Journal
User-agent: *
Disallow: /article_email/
Disallow: /article_print/
Disallow: /PA2VJBNA4R/
Sitemap: http://online.wsj.com/sitemap.xml
ZDNet
User-agent: *
Disallow: /Ads/
Disallow: /redir/
# Disallow: /i/ is removed per 190723
Disallow: /av/
Disallow: /css/
Disallow: /error/
Disallow: /clear/
Disallow: /mac-ad
Disallow: /adlog/
# URS per bug 239819, these were expanded
Disallow: /1300-
Disallow: /1301-
Disallow: /1302-
Disallow: /1303-
Disallow: /1304-
Disallow: /1305-
Disallow: /1306-
Disallow: /1307-
Disallow: /1308-
Disallow: /1309-
Disallow: /1310-
Disallow: /1311-
Disallow: /1312-
Disallow: /1313-
Disallow: /1314-
Disallow: /1315-
Disallow: /1316-
Disallow: /1317-
NY Times
# robots.txt, www.nytimes.com 6/29/2006
#
User-agent: *
Disallow: /pages/college/
Disallow: /college/
Disallow: /library/
Disallow: /learning/
Disallow: /aponline/
Disallow: /reuters/
Disallow: /cnet/
Disallow: /partners/
Disallow: /archives/
Disallow: /indexes/
Disallow: /thestreet/
Disallow: /nytimes-partners/
Disallow: /financialtimes/
Allow: /pages/
Allow: /2003/
Allow: /2004/
Allow: /2005/
Allow: /top/
Allow: /ref/
Allow: /services/xml/
User-agent: Mediapartners-Google*
Disallow:
YouTube
# robots.txt file for YouTube
User-agent: Mediapartners-Google*
Disallow:
User-agent: *
Disallow: /profile
Disallow: /results
Disallow: /browse
Disallow: /t/terms
Disallow: /t/privacy
Disallow: /login
Disallow: /watch_ajax
Disallow: /watch_queue_ajax
Còn Google thì sao?
User-agent: *
Allow: /searchhistory/
Disallow: /news?output=xhtml&
Allow: /news?output=xhtml
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalogues
Disallow: /news
Disallow: /nwshp
Disallow: /?
Disallow: /addurl/image?
Disallow: /pagead/
Disallow: /relpage/
Disallow: /relcontent
Disallow: /sorry/
Disallow: /imgres
Disallow: /keyword/
Disallow: /u/
Disallow: /univ/
Disallow: /cobrand
Disallow: /custom
Disallow: /advanced_group_search
Disallow: /advanced_search
Disallow: /googlesite
Disallow: /preferences
Disallow: /setprefs
Disallow: /swr
Disallow: /url
Disallow: /default
Disallow: /m?
Disallow: /m/search?
Disallow: /wml?
Disallow: /wml/search?
Disallow: /xhtml?
Disallow: /xhtml/search?
Disallow: /xml?
Disallow: /imode?
Disallow: /imode/search?
Disallow: /jsky?
Disallow: /jsky/search?
Disallow: /pda?
Disallow: /pda/search?
Không có nhận xét nào:
Đăng nhận xét