Monday 22 December 2014

Scrape Web data using R

Plenty of people have been scraping data from the web using R for a while now, but I just completed my first project and I wanted to share the code with you.  It was a little hard to work through some of the “issues”, but I had some great help from @DataJunkie on twitter.

As an aside, if you are learning R and coming from another package like SPSS or SAS, I highly advise that you follow the hashtag #rstats on Twitter to be amazed by the kinds of data analysis that are going on right now.

One note.  When I read in my table, it contained a wierd set of characters.  I suspect that it is some sort of encoding, but luckily, I was able to get around it by recoding the data from a character factor to a number by using the stringr package and some basic regex expressions.

Bring on fantasy football!

################################################################

## Help from the followingn sources:

## @DataJunkie on twitter

## http://www.regular-expressions.info/reference.html

## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package

## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package

## http://stackoverflow.com/questions/2443127/how-can-i-use-r-rcurl-xml-packages-to-scrape-this-webpage

################################################################

library(XML)

library(stringr)

# build the URL

url <- paste("http://sports.yahoo.com/nfl/stats/byposition?pos=QB",

        "&conference=NFL&year=season_2009",
        "&timeframe=Week1", sep="")

# read the tables and select the one that has the most rows

tables <- readHTMLTable(url)

n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

tables[[which.max(n.rows)]]

# select the table we need - read as a dataframe

my.table <- tables[[7]]

# delete extra columns and keep data rows

View(head(my.table, n=20))

my.table <- my.table[3:nrow(my.table), c(1:3, 5:12, 14:18, 20:21, 23:24) ]

# rename every column

c.names <- c("Name", "Team", "G", "QBRat", "P_Comp", "P_Att", "P_Yds", "P_YpA", "P_Lng", "P_Int", "P_TD", "R_Att",

        "R_Yds", "R_YpA", "R_Lng", "R_TD", "S_Sack", "S_SackYa", "F_Fum", "F_FumL")

names(my.table) <- c.names

# data get read in with wierd symbols - need to remove - initially stored as character factors

# for the loops, I am manually telling the code which regex to use - assumes constant behavior

# depending on where the wierd characters are -- is this an encoding?

front <- c(1)

back <- c(4:ncol(my.table))

for(f in front) {

    test.front <- as.character(my.table[, f])

    tt.front <- str_sub(test.front, start=3)

    my.table[,f] <- tt.front

}

for(b in back) {

    test <- as.character(my.table[ ,b])

    tt.back <- as.numeric(str_match(test, "\-*\d{1,3}[\.]*[0-9]*"))

    my.table[, b] <- tt.back
}

str(my.table)

View(my.table)

# clear memory and quit R

rm(list=ls())

q()

n

Source: http://www.r-bloggers.com/scrape-web-data-using-r/

3 comments:

  1. Hi,

    i am new to web scrapping, can you please help me to get the amazon reviews for a range of product,
    i have below questions :
    1. how to scrape the data from multiple screens (some pages have pagination like 1/2/3/4/...), i need the data from all the pages
    2. do we need to have the url in a particular format ?
    i am using
    http://www.amazon.in/s/ref=nb_sb_ss_c_1_10?url=search-alias%3Daps&field-keywords=patanjali+products&sprefix=patanjali+%2Caps%2C463&crid=1N5K8O67WXIIG

    3. can i store all the data in different fields of table, like reviews in one field, product id in another, rating in another fields, and likewise?

    thanks...
    Ankit J

    ReplyDelete
  2. Thanks for sharing such beautiful information with us. I hope you will share some more info about.
    yelp scraping

    ReplyDelete
  3. BestProductBuff provides you the best listing stuff from food, beauty, health care, automotive, electronics, and sports product to appliances, computers, and phone accessories, for more info visit:
    bestproductbuff.com

    ReplyDelete