Plenty of people have been scraping data from the web using R for a while now, but I just completed my first project and I wanted to share the code with you. It was a little hard to work through some of the “issues”, but I had some great help from @DataJunkie on twitter.
As an aside, if you are learning R and coming from another package like SPSS or SAS, I highly advise that you follow the hashtag #rstats on Twitter to be amazed by the kinds of data analysis that are going on right now.
One note. When I read in my table, it contained a wierd set of characters. I suspect that it is some sort of encoding, but luckily, I was able to get around it by recoding the data from a character factor to a number by using the stringr package and some basic regex expressions.
Bring on fantasy football!
################################################################
## Help from the followingn sources:
## @DataJunkie on twitter
## http://www.regular-expressions.info/reference.html
## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package
## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package
## http://stackoverflow.com/questions/2443127/how-can-i-use-r-rcurl-xml-packages-to-scrape-this-webpage
################################################################
library(XML)
library(stringr)
# build the URL
url <- paste("http://sports.yahoo.com/nfl/stats/byposition?pos=QB",
"&conference=NFL&year=season_2009",
"&timeframe=Week1", sep="")
# read the tables and select the one that has the most rows
tables <- readHTMLTable(url)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
tables[[which.max(n.rows)]]
# select the table we need - read as a dataframe
my.table <- tables[[7]]
# delete extra columns and keep data rows
View(head(my.table, n=20))
my.table <- my.table[3:nrow(my.table), c(1:3, 5:12, 14:18, 20:21, 23:24) ]
# rename every column
c.names <- c("Name", "Team", "G", "QBRat", "P_Comp", "P_Att", "P_Yds", "P_YpA", "P_Lng", "P_Int", "P_TD", "R_Att",
"R_Yds", "R_YpA", "R_Lng", "R_TD", "S_Sack", "S_SackYa", "F_Fum", "F_FumL")
names(my.table) <- c.names
# data get read in with wierd symbols - need to remove - initially stored as character factors
# for the loops, I am manually telling the code which regex to use - assumes constant behavior
# depending on where the wierd characters are -- is this an encoding?
front <- c(1)
back <- c(4:ncol(my.table))
for(f in front) {
test.front <- as.character(my.table[, f])
tt.front <- str_sub(test.front, start=3)
my.table[,f] <- tt.front
}
for(b in back) {
test <- as.character(my.table[ ,b])
tt.back <- as.numeric(str_match(test, "\-*\d{1,3}[\.]*[0-9]*"))
my.table[, b] <- tt.back
}
str(my.table)
View(my.table)
# clear memory and quit R
rm(list=ls())
q()
n
Source: http://www.r-bloggers.com/scrape-web-data-using-r/
As an aside, if you are learning R and coming from another package like SPSS or SAS, I highly advise that you follow the hashtag #rstats on Twitter to be amazed by the kinds of data analysis that are going on right now.
One note. When I read in my table, it contained a wierd set of characters. I suspect that it is some sort of encoding, but luckily, I was able to get around it by recoding the data from a character factor to a number by using the stringr package and some basic regex expressions.
Bring on fantasy football!
################################################################
## Help from the followingn sources:
## @DataJunkie on twitter
## http://www.regular-expressions.info/reference.html
## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package
## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package
## http://stackoverflow.com/questions/2443127/how-can-i-use-r-rcurl-xml-packages-to-scrape-this-webpage
################################################################
library(XML)
library(stringr)
# build the URL
url <- paste("http://sports.yahoo.com/nfl/stats/byposition?pos=QB",
"&conference=NFL&year=season_2009",
"&timeframe=Week1", sep="")
# read the tables and select the one that has the most rows
tables <- readHTMLTable(url)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
tables[[which.max(n.rows)]]
# select the table we need - read as a dataframe
my.table <- tables[[7]]
# delete extra columns and keep data rows
View(head(my.table, n=20))
my.table <- my.table[3:nrow(my.table), c(1:3, 5:12, 14:18, 20:21, 23:24) ]
# rename every column
c.names <- c("Name", "Team", "G", "QBRat", "P_Comp", "P_Att", "P_Yds", "P_YpA", "P_Lng", "P_Int", "P_TD", "R_Att",
"R_Yds", "R_YpA", "R_Lng", "R_TD", "S_Sack", "S_SackYa", "F_Fum", "F_FumL")
names(my.table) <- c.names
# data get read in with wierd symbols - need to remove - initially stored as character factors
# for the loops, I am manually telling the code which regex to use - assumes constant behavior
# depending on where the wierd characters are -- is this an encoding?
front <- c(1)
back <- c(4:ncol(my.table))
for(f in front) {
test.front <- as.character(my.table[, f])
tt.front <- str_sub(test.front, start=3)
my.table[,f] <- tt.front
}
for(b in back) {
test <- as.character(my.table[ ,b])
tt.back <- as.numeric(str_match(test, "\-*\d{1,3}[\.]*[0-9]*"))
my.table[, b] <- tt.back
}
str(my.table)
View(my.table)
# clear memory and quit R
rm(list=ls())
q()
n
Source: http://www.r-bloggers.com/scrape-web-data-using-r/
Hi,
ReplyDeletei am new to web scrapping, can you please help me to get the amazon reviews for a range of product,
i have below questions :
1. how to scrape the data from multiple screens (some pages have pagination like 1/2/3/4/...), i need the data from all the pages
2. do we need to have the url in a particular format ?
i am using
http://www.amazon.in/s/ref=nb_sb_ss_c_1_10?url=search-alias%3Daps&field-keywords=patanjali+products&sprefix=patanjali+%2Caps%2C463&crid=1N5K8O67WXIIG
3. can i store all the data in different fields of table, like reviews in one field, product id in another, rating in another fields, and likewise?
thanks...
Ankit J
Thanks for sharing such beautiful information with us. I hope you will share some more info about.
ReplyDeleteyelp scraping
BestProductBuff provides you the best listing stuff from food, beauty, health care, automotive, electronics, and sports product to appliances, computers, and phone accessories, for more info visit:
ReplyDeletebestproductbuff.com