Portál AbcLinuxu, 6. května 2024 12:22

Google nerado wget

18.8.2009 16:13 | Přečteno: 1719× | Zo sveta

Google cache nejde tahat wget-om alebo curl-om. Staci ale zmenit user agent a uz nie je problem. Lubovolny nahodny user agent retazec je postacjuci.

Soudruzi z Google si pravdepodobne uzmysleli, ze ochrania svoju cache databazu pred hromadnym stahovanim. Tak filtruju pristup k nej podla toho ci polozka user-agent v HTTP hlavicke obsahuje retazec wget alebo curl. Ak neobsahuje, tak poslu obsah. Ak hej, tak vratia "403 Forbidden".

Cache sa da vyhladat zadanim do google vyhladavacieho policka "cache:<url>". Napr. cache:http://www.abclinuxu.cz. Presmeruje to na nejaky server z Google clustru a posle naspat stranku ako ju google-bot videl pri poslednej navsteve.

Whitelist browserov by som mozno este pochopil (aj ked je uplne rovnako na nic), ale blacklist nechapem uabsolutne. 90% ludi, ktori uz vedia spustit wget, vedia aj zmenit user-agent.
$ wget --user-agent wge 'http://209.85.129.132/search?client=opera&rls=en&hl=en&q=cache:http://www.abclinuxu.cz/&sourceid=opera&num=25&ie=utf-8&oe=utf-8'
--2009-08-18 16:00:46--  http://209.85.129.132/search?client=opera&rls=en&hl=en&q=cache:http://www.abclinuxu.cz/&sourceid=opera&num=25&ie=utf-8&oe=utf-8
Connecting to 209.85.129.132:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `search?client=opera&rls=en&hl=en&q=cache:http:%2F%2Fwww.abclinuxu.cz%2F&sourceid=opera&num=25&ie=utf-8&oe=utf-8.1'

    [ <=>                                                                                              ] 91,817       526K/s   in 0.2s

2009-08-18 16:00:47 (526 KB/s) - `search?client=opera&rls=en&hl=en&q=cache:http:%2F%2Fwww.abclinuxu.cz%2F&sourceid=opera&num=25&ie=utf-8&oe=utf-8.1' saved [91817]

$ wget 'http://209.85.129.132/search?client=opera&rls=en&hl=en&q=cache:http://www.abclinuxu.cz/&sourceid=opera&num=25&ie=utf-8&oe=utf-8'
--2009-08-18 16:00:50--  http://209.85.129.132/search?client=opera&rls=en&hl=en&q=cache:http://www.abclinuxu.cz/&sourceid=opera&num=25&ie=utf-8&oe=utf-8
Connecting to 209.85.129.132:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2009-08-18 16:00:51 ERROR 403: Forbidden.

$ wget --user-agent curl 'http://209.85.129.132/search?client=opera&rls=en&hl=en&q=cache:http://www.abclinuxu.cz/&sourceid=opera&num=25&ie=utf-8&oe=utf-8'
--2009-08-18 16:03:44--  http://209.85.129.132/search?client=opera&rls=en&hl=en&q=cache:http://www.abclinuxu.cz/&sourceid=opera&num=25&ie=utf-8&oe=utf-8
Connecting to 209.85.129.132:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2009-08-18 16:03:45 ERROR 403: Forbidden.

$ wget --user-agent cur 'http://209.85.129.132/search?client=opera&rls=en&hl=en&q=cache:http://www.abclinuxu.cz/&sourceid=opera&num=25&ie=utf-8&oe=utf-8'
--2009-08-18 16:03:49--  http://209.85.129.132/search?client=opera&rls=en&hl=en&q=cache:http://www.abclinuxu.cz/&sourceid=opera&num=25&ie=utf-8&oe=utf-8
Connecting to 209.85.129.132:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `search?client=opera&rls=en&hl=en&q=cache:http:%2F%2Fwww.abclinuxu.cz%2F&sourceid=opera&num=25&ie=utf-8&oe=utf-8.2'

    [ <=>                                                                                              ] 91,817       523K/s   in 0.2s

2009-08-18 16:03:49 (523 KB/s) - `search?client=opera&rls=en&hl=en&q=cache:http:%2F%2Fwww.abclinuxu.cz%2F&sourceid=opera&num=25&ie=utf-8&oe=utf-8.2' saved [91817]
       

Hodnocení: 93 %

        špatnédobré        

Tiskni Sdílej: Linkuj Jaggni to Vybrali.sme.sk Google Del.icio.us Facebook

Komentáře

Nástroje: Začni sledovat (0) ?Zašle upozornění na váš email při vložení nového komentáře. , Tisk

Vložit další komentář

Limoto avatar 18.8.2009 17:27 Limoto | skóre: 32 | blog: Limotův blog
Rozbalit Rozbalit vše Re: Google nerado wget
Odpovědět | Sbalit | Link | Blokovat | Admin

Blik! Jinak není to jenom google cache, je to snad všechno od googlu (a není to jenom wget, nebere to třeba ani urllib)

18.8.2009 23:13 Tomas
Rozbalit Rozbalit vše Re: Google nerado wget
Otazka znie preco...
18.8.2009 23:44 Semo | skóre: 45 | blog: Semo
Rozbalit Rozbalit vše Re: Google nerado wget
Ha fakt, skoro vsetko. Ane neviem preco som si to nevsimol. Asi preto ze hlavna stranka ide ok.
If you hold a Unix shell up to your ear, you can you hear the C.
hikikomori82 avatar 18.8.2009 18:42 hikikomori82 | skóre: 18 | blog: foobar | Košice
Rozbalit Rozbalit vše Re: Google nerado wget
Odpovědět | Sbalit | Link | Blokovat | Admin
Dobre rano stara mama. Číta tu niekto vôbec moje blogy?
Slobodný font na technické kreslenie

Založit nové vláknoNahoru

ISSN 1214-1267, (c) 1999-2007 Stickfish s.r.o.