Web scrapping with Golang
Web scrapping is a technic to parse HTML output of website. Most of the online bots are based on same technic to get required information about particular website or page.
Using XML parser we can parse HTML page and get the required information. However, jquery selector are best to parse HTML page. So, in this tutorial we will be using Jquery library in Golang to parse the HTML doc.
Project Setup and dependencies
As mention above, we will be using Jquery library as a parser. So go get the library using following command
1 |
go get github.com/PuerkitoBio/goquery |
Create a file webscraper.go and open it in any of your favorite text editor.
Web Scraper code to get post from website
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
package main import ( // import standard libraries "fmt" // import third party libraries "github.com/PuerkitoBio/goquery" ) func postScrape() { doc, err := goquery.NewDocument("http://code2succeed.com") if err != nil { log.Fatal(err) } // use CSS selector found with the browser inspector // for each, use index and item doc.Find("#main article .entry-title").Each(func(index int, item *goquery.Selection) { title := item.Text() linkTag := item.Find("a") link, _ := linkTag.Attr("href") fmt.Printf("Post #%d: %s - %s\n", index, title, link) }) } func main() { postScrape() } |
Output
1 2 3 4 5 6 7 8 9 10 11 12 |
Post #0: Getting started with ReactJs - http://www.code2succeed.com/getting-started-with-reactjs/ Post #1: Intro to React - http://www.code2succeed.com/intro-to-react/ Post #2: Caesar Decryption of string using javascript - http://www.code2succeed.com/caesar-decryption-of-string-using-javascript/ Post #3: Caesar encryption of string using JavaScript - http://www.code2succeed.com/caesar-encryption-of-string-using-javascript/ |
Stay tuned for more updates and tutorials !!!