本文共 17174 字,大约阅读时间需要 57 分钟。
煤灰只有一个前提,那就是Golang编程语言。 你可以使用他们的安装指南 https://golang.org/doc/install
在终端输入以下命令安装煤灰和回车。
go get -u github.com/gocolly/colly/...
在使用Colly之前,请确保您具有最新版本。有关更多详细信息,请参见
让我们从一些简单的例子开始。
首先,您需要将Colly导入您的代码库:
import "github.com/gocolly/colly"
煤灰的主要实体是一个 收集器
对象。 收集器
管理网络通信,并负责执行附加收集器工作运行时回调。 与煤灰,你必须初始化 收集器
:
c := colly.NewCollector()
你可以把不同类型的回调函数 收集器
控制或检索信息收集工作。 检查 在包的文档。
收集器
c.OnRequest(func(r *colly.Request) { fmt.Println("Visiting", r.URL)})c.OnError(func(_ *colly.Response, err error) { log.Println("Something went wrong:", err)})c.OnResponseHeaders(func(r *colly.Response) { fmt.Println("Visited", r.Request.URL)})c.OnResponse(func(r *colly.Response) { fmt.Println("Visited", r.Request.URL)})c.OnHTML("a[href]", func(e *colly.HTMLElement) { e.Request.Visit(e.Attr("href"))})c.OnHTML("tr td:nth-of-type(1)", func(e *colly.HTMLElement) { fmt.Println("First column of a table row:", e.Text)})c.OnXML("//h1", func(e *colly.XMLElement) { fmt.Println(e.Text)})c.OnScraped(func(r *colly.Response) { fmt.Println("Finished", r.Request.URL)})
OnRequest
在请求之前调用
OnError
如果请求期间发生错误,则调用
OnResponseHeaders
在收到响应标头后调用
OnResponse
收到回复后调用
OnHTML
OnResponse
如果接收到的内容是HTML ,则在此之后立即调用
OnXML
OnHTML
如果接收到的内容是HTML或XML ,则在之后调用
OnScraped
在OnXML
回调之后调用
Colly是一个高度可定制的抓取框架。它具有合理的默认值,并提供了很多选项来更改它们。
收集器属性的完整列表可以在找到。建议使用初始化收集器的方法colly.NewCollector(options...)
。
使用默认设置创建收集器:
c1 := colly.NewCollector()
创建另一个收集器,并更改User-Agent和url重新访问选
c2 := colly.NewCollector( colly.UserAgent("xy"), colly.AllowURLRevisit(),)
或者
c2 := colly.NewCollector()c2.UserAgent = "xy"c2.AllowURLRevisit = true
通过覆盖收集器的属性,可以在刮削作业的任何时候更改配置。
一个很好的例子是一个User-Agent切换器,它可以在每个请求上更改User-Agent:
const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"func RandomString() string { b := make([]byte, rand.Intn(10)+10) for i := range b { b[i] = letterBytes[rand.Intn(len(letterBytes))] } return string(b)}c := colly.NewCollector()c.OnRequest(func(r *colly.Request) { r.Headers.Set("User-Agent", RandomString())})
可以通过环境变量来更改收集器的默认配置。这使我们可以微调收集器而无需重新编译。环境解析是收集器初始化的最后一步,因此初始化之后的每个配置更改都会覆盖从环境解析的配置。
COLLY_ALLOWED_DOMAINS
(以逗号分隔的域列表)COLLY_CACHE_DIR
(细绳)COLLY_DETECT_CHARSET
(是/否)COLLY_DISABLE_COOKIES
(是/否)COLLY_DISALLOWED_DOMAINS
(以逗号分隔的域列表)COLLY_IGNORE_ROBOTSTXT
(是/否)COLLY_FOLLOW_REDIRECTS
(是/否)COLLY_MAX_BODY_SIZE
(int)COLLY_MAX_DEPTH
(int-0表示无穷大)COLLY_PARSE_HTTP_ERROR_RESPONSE
(是/否)COLLY_USER_AGENT
(细绳)Colly使用Golang的默认作为网络层。可以通过更改默认的来调整HTTP选项。
c := colly.NewCollector()c.WithTransport(&http.Transport{ Proxy: http.ProxyFromEnvironment, DialContext: (&net.Dialer{ Timeout: 30 * time.Second, KeepAlive: 30 * time.Second, DualStack: true, }).DialContext, MaxIdleConns: 100, IdleConnTimeout: 90 * time.Second, TLSHandshakeTimeout: 10 * time.Second, ExpectContinueTimeout: 1 * time.Second,}
有时一些就足够了 log.Println ()
回调函数调用,但有时它不是。 煤灰有内置的收集器调试能力。 调试器调试器接口和不同类型的实现。
将调试器需要一个基本的日志记录 调试
( github.com/gocolly/colly/debug
从煤灰的回购)包。
import ( "github.com/gocolly/colly" "github.com/gocolly/colly/debug")func main() { c := colly.NewCollector(colly.Debugger(&debug.LogDebugger{ })) // [..]}
您可以创建任何类型的自定义调试器实现 接口。 就是一个很好的例子 。
分布式抓取可以以不同的方式实现根据抓取任务的要求是什么。 大部分时间是足够规模的网络通信层可以很容易地通过使用代理和煤灰的代理转换器。
使用代理扳道工刮仍然集中分布在多个代理服务器的HTTP请求。 通过其“煤灰支持代理切换 SetProxyFunc
成员。 任何可以通过自定义函数 SetProxyFunc()
的签名 func(*http.Request) (*url.URL, error)
。
煤灰有一个内置的代理切换器,旋转代理对每个请求的列表。
package mainimport ( "github.com/gocolly/colly" "github.com/gocolly/colly/proxy")func main() { c := colly.NewCollector() if p, err := proxy.RoundRobinProxySwitcher( "socks5://127.0.0.1:1337", "socks5://127.0.0.1:1338", "http://127.0.0.1:8080", ); err == nil { c.SetProxyFunc(p) } // ...}
实现自定义代理切换器:
var proxies []*url.URL = []*url.URL{ &url.URL{ Host: "127.0.0.1:8080"}, &url.URL{ Host: "127.0.0.1:8081"},}func randomProxySwitcher(_ *http.Request) (*url.URL, error) { return proxies[random.Intn(len(proxies))], nil}// ...c.SetProxyFunc(randomProxySwitcher)
独立管理和分布式刮刀你能做的最好的就是包装的刮刀服务器。 服务器可以是任何类型的服务像HTTP、TCP服务器或Google App Engine。 使用自定义 实现集中和持久的饼干和访问url处理。
可以找到一个示例实现 。
访问URL和饼干默认数据存储内存中。 短住刮刀的工作,这很方便,但它可以是一个严重的限制在处理大规模或爬行需要长时间运行的工作。
煤灰有能力取代默认的内存中存储任何存储后端实现 接口。 看看 。
煤灰有一个内存中的存储后端存储饼干和访问url,但它可以覆盖任何自定义存储后端实现 。
默认端锅灰。 使用 覆盖。
看到 获取详细信息。
如果任务足够复杂或具有不同类型的子任务,建议使用多个收集器来执行一个抓取作业。一个很好的例子是,其中使用了两个收集器-一个解析列表视图并处理分页,另一个则收集课程详细信息。
Colly具有一些内置方法来支持多个收集器的使用。
Clone()
如果收集器具有类似的配置,则可以使用收集器的方法。Clone()
复制具有相同配置但没有附加回调的收集器。
c := colly.NewCollector( colly.UserAgent("myUserAgent"), colly.AllowedDomains("foo.com", "bar.com"),)// Custom User-Agent and allowed domains are cloned to c2c2 := c.Clone()
使用收集器的Request()
功能可以与其他收集器共享上下文。
共享上下文示例:
c.OnResponse(func(r *colly.Response) { r.Ctx.Put(r.Headers.Get("Custom-Header")) c2.Request("GET", "https://foo.com/", nil, r.Ctx, nil)})
Colly的默认配置经过优化,可以在一项作业中抓取较少数量的站点。如果您想抓取数百万个网站,则此设置不是最佳选择。以下是一些调整:
默认情况下,Colly将cookie和已访问的URL存储在内存中。您可以使用任何自定义后端替换内置的内存中存储后端。查看更多详细信息。
默认情况下,在请求未完成时Colly会阻塞,因此Collector.Visit
从回调递归调用会产生不断增长的堆栈。有了Collector.Async = true
这可避免。(不要忘了c.Wait()
与async一起使用。)
Colly使用HTTP保持活动来提高抓取速度。它需要打开文件描述符,因此长时间运行的作业很容易达到max-fd限制。
可以使用以下代码禁用HTTP Keep-alive:
c := colly.NewCollector()c.WithTransport(&http.Transport{ DisableKeepAlives: true,})
扩展是Colly附带的小型帮助程序实用程序。插件列表可。
下面的例子使随机代理切换器和两次引用setter扩展并访问httpbin.org。
import ( "log" "github.com/gocolly/colly" "github.com/gocolly/colly/extensions")func main() { c := colly.NewCollector() visited := false extensions.RandomUserAgent(c) extensions.Referer(c) c.OnResponse(func(r *colly.Response) { log.Println(string(r.Body)) if !visited { visited = true r.Request.Visit("/get?q=2") } }) c.Visit("http://httpbin.org/get")}
package mainimport ( "fmt" "github.com/gocolly/colly")func main() { // Instantiate default collector c := colly.NewCollector( // Visit only domains: hackerspaces.org, wiki.hackerspaces.org colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"), ) // On every a element which has href attribute call callback c.OnHTML("a[href]", func(e *colly.HTMLElement) { link := e.Attr("href") // Print link fmt.Printf("Link found: %q -> %s\n", e.Text, link) // Visit link found on page // Only those links are visited which are in AllowedDomains c.Visit(e.Request.AbsoluteURL(link)) }) // Before making a request print "Visiting ..." c.OnRequest(func(r *colly.Request) { fmt.Println("Visiting", r.URL.String()) }) // Start scraping on https://hackerspaces.org c.Visit("https://hackerspaces.org/")}
package mainimport ( "fmt" "github.com/gocolly/colly")func main() { // Create a collector c := colly.NewCollector() // Set HTML callback // Won't be called if error occurs c.OnHTML("*", func(e *colly.HTMLElement) { fmt.Println(e) }) // Set error handler c.OnError(func(r *colly.Response, err error) { fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err) }) // Start scraping c.Visit("https://definitely-not-a.website/")}
package mainimport ( "log" "github.com/gocolly/colly")func main() { // create a new collector c := colly.NewCollector() // authenticate err := c.Post("http://example.com/login", map[string]string{ "username": "admin", "password": "admin"}) if err != nil { log.Fatal(err) } // attach callbacks after login c.OnResponse(func(r *colly.Response) { log.Println("response received", r.StatusCode) }) // start scraping c.Visit("https://example.com/")}
package mainimport ( "fmt" "github.com/gocolly/colly")func main() { // Instantiate default collector c := colly.NewCollector( // MaxDepth is 1, so only the links on the scraped page // is visited, and no further links are followed colly.MaxDepth(1), ) // On every a element which has href attribute call callback c.OnHTML("a[href]", func(e *colly.HTMLElement) { link := e.Attr("href") // Print link fmt.Println(link) // Visit link found on page e.Request.Visit(link) }) // Start scraping on https://en.wikipedia.org c.Visit("https://en.wikipedia.org/")}
package mainimport ( "fmt" "io/ioutil" "net/http" "os" "time" "github.com/gocolly/colly")func generateFormData() map[string][]byte { f, _ := os.Open("gocolly.jpg") defer f.Close() imgData, _ := ioutil.ReadAll(f) return map[string][]byte{ "firstname": []byte("one"), "lastname": []byte("two"), "email": []byte("onetwo@example.com"), "file": imgData, }}func setupServer() { var handler http.HandlerFunc = func(w http.ResponseWriter, r *http.Request) { fmt.Println("received request") err := r.ParseMultipartForm(10000000) if err != nil { fmt.Println("server: Error") w.WriteHeader(500) w.Write([]byte("Internal Server Error")) return } w.WriteHeader(200) fmt.Println("server: OK") w.Write([]byte("Success")) } go http.ListenAndServe(":8080", handler)}func main() { // Start a single route http server to post an image to. setupServer() c := colly.NewCollector(colly.AllowURLRevisit(), colly.MaxDepth(5)) // On every a element which has href attribute call callback c.OnHTML("html", func(e *colly.HTMLElement) { fmt.Println(e.Text) time.Sleep(1 * time.Second) e.Request.PostMultipart("http://localhost:8080/", generateFormData()) }) // Before making a request print "Visiting ..." c.OnRequest(func(r *colly.Request) { fmt.Println("Posting gocolly.jpg to", r.URL.String()) }) // Start scraping c.PostMultipart("http://localhost:8080/", generateFormData()) c.Wait()}
package mainimport ( "fmt" "github.com/gocolly/colly")func main() { // Instantiate default collector c := colly.NewCollector( // MaxDepth is 2, so only the links on the scraped page // and links on those pages are visited colly.MaxDepth(2), colly.Async(true), ) // Limit the maximum parallelism to 2 // This is necessary if the goroutines are dynamically // created to control the limit of simultaneous requests. // // Parallelism can be controlled also by spawning fixed // number of go routines. c.Limit(&colly.LimitRule{ DomainGlob: "*", Parallelism: 2}) // On every a element which has href attribute call callback c.OnHTML("a[href]", func(e *colly.HTMLElement) { link := e.Attr("href") // Print link fmt.Println(link) // Visit link found on page on a new thread e.Request.Visit(link) }) // Start scraping on https://en.wikipedia.org c.Visit("https://en.wikipedia.org/") // Wait until threads are finished c.Wait()}
package mainimport ( "bytes" "log" "github.com/gocolly/colly" "github.com/gocolly/colly/proxy")func main() { // Instantiate default collector c := colly.NewCollector(colly.AllowURLRevisit()) // Rotate two socks5 proxies rp, err := proxy.RoundRobinProxySwitcher("socks5://127.0.0.1:1337", "socks5://127.0.0.1:1338") if err != nil { log.Fatal(err) } c.SetProxyFunc(rp) // Print the response c.OnResponse(func(r *colly.Response) { log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1)) }) // Fetch httpbin.org/ip five times for i := 0; i < 5; i++ { c.Visit("https://httpbin.org/ip") }}
package mainimport ( "fmt" "github.com/gocolly/colly" "github.com/gocolly/colly/queue")func main() { url := "https://httpbin.org/delay/1" // Instantiate default collector c := colly.NewCollector() // create a request queue with 2 consumer threads q, _ := queue.New( 2, // Number of consumer threads &queue.InMemoryQueueStorage{ MaxSize: 10000}, // Use default queue storage ) c.OnRequest(func(r *colly.Request) { fmt.Println("visiting", r.URL) }) for i := 0; i < 5; i++ { // Add URLs to the queue q.AddURL(fmt.Sprintf("%s?n=%d", url, i)) } // Consume URLs q.Run(c)}
package mainimport ( "fmt" "time" "github.com/gocolly/colly" "github.com/gocolly/colly/debug")func main() { url := "https://httpbin.org/delay/2" // Instantiate default collector c := colly.NewCollector( // Attach a debugger to the collector colly.Debugger(&debug.LogDebugger{ }), colly.Async(true), ) // Limit the number of threads started by colly to two // when visiting links which domains' matches "*httpbin.*" glob c.Limit(&colly.LimitRule{ DomainGlob: "*httpbin.*", Parallelism: 2, RandomDelay: 5 * time.Second, }) // Start scraping in four threads on https://httpbin.org/delay/2 for i := 0; i < 4; i++ { c.Visit(fmt.Sprintf("%s?n=%d", url, i)) } // Start scraping on https://httpbin.org/delay/2 c.Visit(url) // Wait until threads are finished c.Wait()}
package mainimport ( "fmt" "github.com/gocolly/colly" "github.com/gocolly/colly/debug")func main() { url := "https://httpbin.org/delay/2" // Instantiate default collector c := colly.NewCollector( // Turn on asynchronous requests colly.Async(true), // Attach a debugger to the collector colly.Debugger(&debug.LogDebugger{ }), ) // Limit the number of threads started by colly to two // when visiting links which domains' matches "*httpbin.*" glob c.Limit(&colly.LimitRule{ DomainGlob: "*httpbin.*", Parallelism: 2, //Delay: 5 * time.Second, }) // Start scraping in five threads on https://httpbin.org/delay/2 for i := 0; i < 5; i++ { c.Visit(fmt.Sprintf("%s?n=%d", url, i)) } // Wait until threads are finished c.Wait()}
package mainimport ( "log" "github.com/gocolly/colly" "github.com/gocolly/colly/queue" "github.com/gocolly/redisstorage")func main() { urls := []string{ "http://httpbin.org/", "http://httpbin.org/ip", "http://httpbin.org/cookies/set?a=b&c=d", "http://httpbin.org/cookies", } c := colly.NewCollector() // create the redis storage storage := &redisstorage.Storage{ Address: "127.0.0.1:6379", Password: "", DB: 0, Prefix: "httpbin_test", } // add storage to the collector err := c.SetStorage(storage) if err != nil { panic(err) } // delete previous data from storage if err := storage.Clear(); err != nil { log.Fatal(err) } // close redis client defer storage.Client.Close() // create a new request queue with redis storage backend q, _ := queue.New(2, storage) c.OnResponse(func(r *colly.Response) { log.Println("Cookies:", c.Cookies(r.Request.URL.String())) }) // add URLs to the queue for _, u := range urls { q.AddURL(u) } // consume requests q.Run(c)}
package mainimport ( "fmt" "github.com/gocolly/colly")func main() { // Instantiate default collector c := colly.NewCollector() // Before making a request put the URL with // the key of "url" into the context of the request c.OnRequest(func(r *colly.Request) { r.Ctx.Put("url", r.URL.String()) }) // After making a request get "url" from // the context of the request c.OnResponse(func(r *colly.Response) { fmt.Println(r.Ctx.Get("url")) }) // Start scraping on https://en.wikipedia.org c.Visit("https://en.wikipedia.org/")}
package mainimport ( "encoding/json" "log" "net/http" "github.com/gocolly/colly")type pageInfo struct { StatusCode int Links map[string]int}func handler(w http.ResponseWriter, r *http.Request) { URL := r.URL.Query().Get("url") if URL == "" { log.Println("missing URL argument") return } log.Println("visiting", URL) c := colly.NewCollector() p := &pageInfo{ Links: make(map[string]int)} // count links c.OnHTML("a[href]", func(e *colly.HTMLElement) { link := e.Request.AbsoluteURL(e.Attr("href")) if link != "" { p.Links[link]++ } }) // extract status code c.OnResponse(func(r *colly.Response) { log.Println("response received", r.StatusCode) p.StatusCode = r.StatusCode }) c.OnError(func(r *colly.Response, err error) { log.Println("error:", r.StatusCode, err) p.StatusCode = r.StatusCode }) c.Visit(URL) // dump results b, err := json.Marshal(p) if err != nil { log.Println("failed to serialize response:", err) return } w.Header().Add("Content-Type", "application/json") w.Write(b)}func main() { // example usage: curl -s 'http://127.0.0.1:7171/?url=http://go-colly.org/' addr := ":7171" http.HandleFunc("/", handler) log.Println("listening on", addr) log.Fatal(http.ListenAndServe(addr, nil))}
package mainimport ( "fmt" "regexp" "github.com/gocolly/colly")func main() { // Instantiate default collector c := colly.NewCollector( // Visit only root url and urls which start with "e" or "h" on httpbin.org colly.URLFilters( regexp.MustCompile("http://httpbin\\.org/(|e.+)$"), regexp.MustCompile("http://httpbin\\.org/h.+"), ), ) // On every a element which has href attribute call callback c.OnHTML("a[href]", func(e *colly.HTMLElement) { link := e.Attr("href") // Print link fmt.Printf("Link found: %q -> %s\n", e.Text, link) // Visit link found on page // Only those links are visited which are matched by any of the URLFilter regexps c.Visit(e.Request.AbsoluteURL(link)) }) // Before making a request print "Visiting ..." c.OnRequest(func(r *colly.Request) { fmt.Println("Visiting", r.URL.String()) }) // Start scraping on http://httpbin.org c.Visit("http://httpbin.org/")}
转载地址:http://fqvmi.baihongyu.com/