| Svelte Hacker News

points by nullify88 5 years ago

You can HTTP GET tweets again by changing your useragent to Googlebot.

curl -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://google.com/bot.html)" "https://twitter.com/zarfeblong/status/1339742840142872577"

Peak SEO when users are faced with more friction than Googlebots and crawlers.

neiman 5 years ago

In 2020, the only way for netzidents to do get what they naturally deserve is by hacking.

bzb6 5 years ago

How do you “naturally deserve” to access the contents of the Twitter website?
- morsch 5 years ago
  
  I agree, that's just cruel. No one deserves being subjected to Twitter.
- devmor 5 years ago
  
  Since it became a dissemination service for public officials. The moment it became illegal for the US President to block people on Twitter, it should have become illegal for Twitter to restrict access to information to the public, for the same reason.
cooper12 5 years ago

Setting your user agent would only be considered hacking by the same people who think the Internet is a series of pipes. The browsers themselves copy each other's user agents for interoperability, so it's far past the point that changing it to look like another agent would be considered devious.
- nextaccountic 5 years ago
  
  Yeah, but on the POV of whoever runs the network, circumventing such blocks is "abuse"
1vuio0pswjnm7 5 years ago

The original web browser, NCSA Mosiac, encouraged users to change their User-Agent string, so-called "spoofing" or "masquerading".
https://raw.githubusercontent.com/alandipert/ncsa-mosaic/mas...
The User-Agent header is not mandatory and was never intended to be used by tech companies for denying access or fingerprinting. It was supposed to be used, at the user's discretion, to help with interoperability problems. RFC7231 specifically refers to user-agent masquerading by the user as a useful practice. It explicitly discourages using this header as a means of supposed user identification, e.g., fingerprinting.
https://tools.ietf.org/html/rfc7231#section-5.5.3

antpls 5 years ago

Given that trick is spreading for several sites now, the trick won't last long. Google could for example generate secret unique user agents for the biggest players. Biggest players would then only allow requests from that secret unique UA.

txdv 5 years ago

so much to google embracing the open web
smarx007 5 years ago

I think Google shares IP range blocks so you could implement a check like "if(isGooglebot(user_agent) && isGooglebotIp(ip_addr))" in your system.
Edit: ah no, they stopped https://developers.google.com/search/docs/advanced/crawling/.... I don't think 2 DNS lookups are acceptable to block a GET request but I think it can be done out of band, i.e. isGooglebotIp function can fire away a Redis query and if nothing was found, to put the ip into the DNS verify queue. A few requests later, the bot will now get banned thanks to a new record in Redis.
- 1vuio0pswjnm7 5 years ago
  
  No need to use a Googlebot UA string. Others will work. Such as
  Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0
miyuru 5 years ago

no need a secret unique UA. just PTR the IP and check the host.
This method is already used by stackoverflow to hide the sitemap.
https://meta.stackexchange.com/a/37272/158100
smileysteve 5 years ago

Things that would be inadvisable while most of these companies are actively under suit for antitrust issues.
kobalsky 5 years ago

it already happens, try to create your own search engine an index Amazon. captchas everywhere, their robots.txt is just for show.
Too 5 years ago

This trick is almost older than the internet, so if someone cared they would block it already. Sending google.com as Referrer is another variant of it. Before stackoverflow days this was very useful for getting past the paywall on expertsexchange for example.
I was under the impression that serving other content to google would greatly punish your pagerank and even pull you off the search results completely.

1vuio0pswjnm7 5 years ago

When Twitter announced they were going to stop supporting browsers not on their list of supported browsers, I figured their attempts to block would be something more than just checking the value of the user-agent header.

They should just announce that users must use a particular user-agent header value and provide a list of approved values. If no one else compiles a list of acceptable user-agent header values for Twitter, I might have to do it.

Every user should just use the same user-agent header value. That would negate any utility of the user-agent header.

sethaurus 5 years ago

It’s been received wisdom until now that Google penalizes websites which behave differently when scraped by the Googlebot. Is that no longer the case?

SXX 5 years ago

Pinterest has proven by spamming SERP for years that if you're big enough Google will close eyes on it.
- syshum 5 years ago
  
  That applies to more than just google
  If you are big enough, there are separate rules for you (or no rules)
smarx007 5 years ago

They will serve same content to users with JS enabled and legit Googlebots while blocking clients w/o JS and bots. I don't think it violates Google rules but ofc is of questionable decency.
- ec109685 5 years ago
  
  And Google can verify by crawling the site with and without JavaScript.
  Also, google has a commercial relationship with them: https://www.convinceandconvert.com/social-media-research/twi...

67868018 5 years ago

You just need the word "Bot" in your user agent. Requires for fetching Twittercards for link previews too. Changed earlier this year.