The last exercise in the Go Tour – parallelizing a web crawler – turned out to be quite a bit more interesting than I’d expected. If anyone has suggested improvements from which I can learn a bit more, or there own solutions posted, let me know – my exercise solution is on github. I’ve tried to stick to the tour content (ie. only using channels rather than the sync package for accessing shared data).
Spoiler Alert: If you are learning Golang and haven’t yet worked through the Go-Tour, go and do so now. If you get stuck, keep struggling, take a break, try again in a few days etc., before looking at other peoples’ solutions.
The solution I ended up with has a Crawl() function very similar to the original, just with two extra function parameters:
func Crawl(url string, depth int, fetcher Fetcher,
startCrawl func(string) bool, crawlComplete chan string) {
if depth <= 0 {
crawlComplete <- url
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
crawlComplete <- url
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
if startCrawl(u) {
go Crawl(u, depth-1, fetcher, startCrawl, crawlComplete)
}
}
crawlComplete <- url
}
The two parameters are:
- startCrawl func(url string) bool – used as a check before spawning a new ‘go Crawl(url)’ to ensure that we don’t crawl the same url twice.
- crawlComplete chan string – used to signal that the Crawl function has fetched the page and finished spawning any child go-routines.
These two resources are created and passed in to the initial Crawl() call in the main() function:
func main() {
startCrawl := make(chan StartCrawlData)
crawlComplete := make(chan string)
quitNow := make(chan bool)
go processCrawls(startCrawl, crawlComplete, quitNow)
// Returns whether a crawl should be started for a given
// URL.
startCrawlFn := func(url string) bool {
resultChan := make(chan bool)
startCrawl <- StartCrawlData{url, resultChan}
return <-resultChan
}
Crawl("http://golang.org/", 4, fetcher, startCrawlFn,
crawlComplete)
<-quitNow
}
Access to the shared state of which urls have been crawled and when all Crawls() have finished etc., is managed via those channels in the processCrawls() go-routine, so that the main() can simply call the first Crawl() and then wait to quit. I want to check how cheap the temporary creation of a channel is (for the return value of the startCrawlFn above) – I think I saw this method on an earlier GoLang tutorial example, but otherwise I’m happy with the solution
Other solutions to learn from:
- Sonia Codes’ solution
- [I'll add more as I find them, now that I'm finished my own solution
]
Filed under: golang

Latest Official Posts