Thursday 28 August 2014

Extract data from Web Scraping C#

I am MVC ASP.NET developer.

I have received the contents from any url, i.e. http, https etc. using WebRequest

class.

I have received all the content of that particular url. (for now I took

http://google.com)

My next step is to extract buttons, header, footer, colors, text etc.

Here is my code for now:

public ActionResult GetContent(UrlModel model) //model having a string URL
which is entered in a text box and method hits using submit button.
{
    //WebRequest request = WebRequest.Create(model.URL);

    WebRequest request = WebRequest.Create(model.URL);

    request.Credentials = CredentialCache.DefaultCredentials;

    WebResponse response = request.GetResponse();

    Stream dataStream = response.GetResponseStream();

    StreamReader reader = new StreamReader(dataStream);

    string responseFromServer = reader.ReadToEnd();
    ViewBag.Response = responseFromServer;

    reader.Close();
    response.Close();
    return View();
}

Can someone help me with writing the code ?

Also do suggest me with some techniques of data extraction in C#.



Source: http://stackoverflow.com/questions/21901162/extract-data-from-web-scraping-c-

sharp

Wednesday 27 August 2014

Scrapy, scraping price data from StubHub


I've been having a difficult time with this one.

I want to scrape all the prices listed for this Bruno Mars concert at the Hollywood Bowl so I can get the average price.

http://www.stubhub.com/bruno-mars-tickets/bruno-mars-hollywood-hollywood-bowl-31-5-2014-4449604/

I've located the prices in the HTML and the xpath is pretty straightforward but I cannot get any values to return.

I think it has something to do with the content being generated via javascript or ajax but I can't figure out how to send the correct request to get the code to work.

Here's what I have:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector

from deeptix.items import DeeptixItem

class TicketSpider(BaseSpider):
    name = "deeptix"
    allowed_domains = ["stubhub.com"]
    start_urls = ["http://www.stubhub.com/bruno-mars-tickets/bruno-mars-hollywood-hollywood-bowl-31-5-2014-4449604/"]

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//div[contains(@class, "q_cont")]')
    items = []
    for site in sites:
        item = DeeptixItem()
        item['price'] = site.xpath('span[contains(@class, "q")]/text()').extract()
        items.append(item)
    return items

Any help would be greatly appreciated I've been struggling with this one for quite some time now. Thank you in advance!


Source: http://stackoverflow.com/questions/22770917/scrapy-scraping-price-data-from-stubhub

How do you scrape AJAX pages?


Overview:

All screen scraping first requires manual review of the page you want to extract

resources from. When dealing with AJAX you usually just need to analyze a bit more

than just simply the HTML.

When dealing with AJAX this just means that the value you want is not in the initial

HTML document that you requested, but that javascript will be exectued which asks the

server for the extra information you want.

You can therefore usually simply analyze the javascript and see which request the

javascript makes and just call this URL instead from the start.

Example:

Take this as an example, assume the page you want to scrape from has the following

script:

<script type="text/javascript">
function ajaxFunction()
{
var xmlHttp;
try
  {
  // Firefox, Opera 8.0+, Safari
  xmlHttp=new XMLHttpRequest();
  }
catch (e)
  {
  // Internet Explorer
  try
    {
    xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
    }
  catch (e)
    {
    try
      {
      xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
      }
    catch (e)
      {
      alert("Your browser does not support AJAX!");
      return false;
      }
    }
  }
  xmlHttp.onreadystatechange=function()
    {
    if(xmlHttp.readyState==4)
      {
      document.myForm.time.value=xmlHttp.responseText;
      }
    }
  xmlHttp.open("GET","time.asp",true);
  xmlHttp.send(null);
  }
</script>

Then all you need to do is instead do an HTTP request to time.asp of the same server

instead. Example from w3schools.


Sporce: http://stackoverflow.com/questions/260540/how-do-you-scrape-ajax-pages

Using Perl to scrape a website


I am interested in writing a perl script that goes to the following link and extracts

the number 1975: https://familysearch.org/search/collection/results#count=20&query=

%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego

%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace

%3AWhite&collection_id=2000219

That website is the amount of white men born in the year 1923 who live in San Diego

County, California in 1940. I am trying to do this in a loop structure to generalize

over multiple counties and birth years.

In the file, locations.txt, I put the list of counties, such as San Diego County.

The current code runs, but instead of the # 1975, it displays unknown. The number 1975

should be in $val\n.

I would very much appreciate any help!

#!/usr/bin/perl

use strict;

use LWP::Simple;

open(L, "locations26.txt");

my $url = 'https://familysearch.org/search/collection/results#count=20&query=

%2Bevent_place_level_1%3A%22California%22%20%2Bevent_place_level_2%3A%22%LOCATION%

%22%20%2Bbirth_year%3A%YEAR%-%YEAR%~%20%2Bgender%3AM%20%2Brace

%3AWhite&collection_id=2000219';

open(O, ">out26.txt");
 my $oldh = select(O);
 $| = 1;
 select($oldh);
 while (my $location = <L>) {
     chomp($location);
     $location =~ s/ /+/g;
      foreach my $year (1923..1923) {
                 my $u = $url;
                 $u =~ s/%LOCATION%/$location/;
                 $u =~ s/%YEAR%/$year/;
                 #print "$u\n";
                 my $content = get($u);
                 my $val = 'unknown';
                 if ($content =~ / of .strong.([0-9,]+)..strong. /) {
                         $val = $1;
                 }
                 $val =~ s/,//g;
                 $location =~ s/\+/ /g;
                 print "'$location',$year,$val\n";
                 print O "'$location',$year,$val\n";
         }
     }

Update: API is not a viable solution. I have been in contact with the site developer.

The API does not apply to that part of the webpage. Hence, any solution pertaining to

JSON will not be applicbale.



Source: http://stackoverflow.com/questions/14654288/using-perl-to-scrape-a-website

Tuesday 26 August 2014

Data Scraping using php


Here is my code

    $ip=$_SERVER['REMOTE_ADDR'];

    $url=file_get_contents("http://whatismyipaddress.com/ip/$ip");

    preg_match_all('/<th>(.*?)<\/th><td>(.*?)<\/td>/s',$url,$output,PREG_SET_ORDER);

    $isp=$output[1][2];

    $city=$output[9][2];

    $state=$output[8][2];

    $zipcode=$output[12][2];

    $country=$output[7][2];

    ?>
    <body>
    <table align="center">
    <tr><td>ISP :</td><td><?php echo $isp;?></td></tr>
    <tr><td>City :</td><td><?php echo $city;?></td></tr>
    <tr><td>State :</td><td><?php echo $state;?></td></tr>
    <tr><td>Zipcode :</td><td><?php echo $zipcode;?></td></tr>
    <tr><td>Country :</td><td><?php echo $country;?></td></tr>
    </table>
    </body>

How do I find out the ISP provider of a person viewing a PHP page?

Is it possible to use PHP to track or reveal it?

Error: http://i.imgur.com/LGWI8.png

Curl Scrapping

<?php
$curl_handle=curl_init();
curl_setopt( $curl_handle, CURLOPT_FOLLOWLOCATION, true );
$url='http://www.whatismyipaddress.com/ip/132.123.23.23';
curl_setopt($curl_handle, CURLOPT_URL,$url);
curl_setopt($curl_handle, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Your application name');
$query = curl_exec($curl_handle);

curl_close($curl_handle);
preg_match_all('/<th>(.*?)<\/th><td>(.*?)<\/td>/s',$url,$output,PREG_SET_ORDER);
echo $query;
$isp=$output[1][2];

$city=$output[9][2];

$state=$output[8][2];

$zipcode=$output[12][2];

$country=$output[7][2];
?>
<body>
<table align="center">
<tr><td>ISP :</td><td><?php echo $isp;?></td></tr>
<tr><td>City :</td><td><?php echo $city;?></td></tr>
<tr><td>State :</td><td><?php echo $state;?></td></tr>
<tr><td>Zipcode :</td><td><?php echo $zipcode;?></td></tr>
<tr><td>Country :</td><td><?php echo $country;?></td></tr>
</table>
</body>

Error: http://i.imgur.com/FJIq6.png

What's is wrong with my code here? Any alternative code , that i can use here.

I am not able to scrape that data as described here. http://i.imgur.com/FJIq6.png

P.S. Please post full code. It would be easier for me to understand.



Source: http://stackoverflow.com/questions/10461088/data-scraping-using-php

PDF scraping using R


I have been using the XML package successfully for extracting HTML tables but want to extend to PDF's. From previous questions it does not appear that there is a simple R solution but wondered if there had been any recent developments

Failing that, is there some way in Python (in which I am a complete Novice) to obtain and manipulate pdfs so that I could finish the job off with the R XML package

Extracting text from PDFs is hard, and nearly always requires lots of care.

I'd start with the command line tools such as pdftotext and see what they spit out. The problem is that PDFs can store the text in any order, can use awkward font encodings, and can do things like use ligature characters (the joined up 'ff' and 'ij' that you see in proper typesetting) to throw you.

pdftotext is installable on any Linux system



Source: http://stackoverflow.com/questions/7918718/pdf-scraping-using-r

Monday 25 August 2014

Php Scraping data from a website


I am very new to programming and need a little help with getting data from a website and passing it into my PHP script.

The website is http://www.birthdatabase.com/.

I would like to plug in a name (First and Last) and retrieve the result. I know you can query the site by passing the name in the URL, but I am having problems scraping the results.

http://www.birthdatabase.com/cgi-bin/query.pl?textfield=FIRST&textfield2=LAST&age=&affid=

I am using the file_get_contents($URL) function to get the page but need help after that. Specifically, I would like to scrape only the results from a certain state if there are multiple results for that name.



You need the awesome simple_html_dom class.

With this class you can query the webpage's DOM in a similar way to jQuery.

First include the class in your page, then get the page content with this snippet:

$html = file_get_html('http://www.birthdatabase.com/cgi-bin/query.pl?textfield=' . $first . '&textfield2=' . $last . '&age=&affid=');

Then you can use CSS selections to scrape your data (something like this):

$n = 0;
foreach($html->find('table tbody tr td div font b table tbody') as $element) {
    @$row[$n]['tr']  = $element->find('tr')->text;
    $n++;
}

// output your data
print_r($row);



Source: http://stackoverflow.com/questions/15601584/php-scraping-data-from-a-website

Obtaining reddit data

I am interested in obtaining data from different reddit subreddits. Does anyone know if there is a reddit/other api similar like twitter does to crawl all the pages?


Yes, reddit has an API that can be used for a variety of purposes such as data collection, automatic commenting bots, or even to assist in subreddit moderation.

There are a few places to discover information on reddit's API:

    github reddit wiki -- provides the overview and rules for using reddit's API (follow the rules)
    automatically generated API docs -- provides information on the requests needed to access most of the API endpoints
    /r/redditdev -- the reddit community dedicated to answering questions both about reddit's source code and about reddit's API

If there is a particular programming language you are already familiar with, you should check out the existing set of API wrappers for various languages. Despite my bias (I am the package maintainer) I am quite certain PRAW, for python, has support for the largest number of reddit API features.



Source: http://stackoverflow.com/questions/14322834/obtaining-reddit-data

Sunday 24 August 2014

Scraping data in dynamic sites

I'm trying to scrape data from our local government. What I want is address from kids adoption offices. Here, in Brazil, all adoptions go through the government. So I have the URL of one office, there are 2 or 3 thousands more. But if I can manage to get one, the others will be easy. I made many attempts, bellow I show three.

The problem could be related to a Javascript (Ajax maybe) that refresh the page.

Note: I am not a PHP developer.

First attempt

echo '<html><head></head><body>';
echo '<h1>Scraper PHP GET 1</h1>';

echo ini_get("allow_url_fopen");
echo ini_get("allow_url_fopen");

// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';

//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';

$html = file_get_contents($url);
var_dump($html);

echo '</body></html>';

// Output
// 11
// Warning:
file_get_contents(http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?
transacao=CONSULTA&vara=2673) [function.file-get-contents]: failed to open stream: HTTP
request failed! HTTP/1.1 404 Not Found in /home/rsl/www/sc01_get.php on line 14
// bool(false)

Second attempt

echo '<html><head></head><body>';
echo '<h1>Scraper PHP CURL 3</h1>';

// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';

//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';

$curl = curl_init($url);
@curl_setopt($curl, CURLOPT_POSTFIELDS, "foo");
@curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
@curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "POST");;

$html=@curl_exec($curl);

if (!$html) {
    echo "<br />cURL error number:" .curl_errno($curl);
    echo "<br />cURL error:" . curl_error($curl);
    exit;
}
else{
   echo '<br>begin HTML[';
    echo  $html;
   echo '<br>]end html ';
}
echo '</body></html>';

// Output
// 1

third attempt

function curl($url){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.6 (KHTML, like Gecko) Chrome/16.0.897.0 Safari/535.6');
    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
    curl_setopt($ch, CURLOPT_REFERER, "http://www.windowsphone.com");

    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}

echo '<html><head></head><body>';
echo '<h1>Scraper PHP CURL 5</h1>';

// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';

//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';

$curl = curl_init($url);
@curl_setopt($curl, CURLOPT_POSTFIELDS, "foo");
@curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
@curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "POST");;

$html=@curl($curl);


if (!$html) {
    echo "<br />cURL error number:" .curl_errno($curl);
    echo "<br />cURL error:" . curl_error($curl);
    exit;
}
else{
    echo '<br>begin HTML[';
    echo  $html;
    echo '<br>]end html ';
}
echo '</body></html>';

// Output
// cURL error number:0
// cURL error:

If the pages are really ajax based meaning the information that you need to scrape is loaded or shown through javascript execution, you will need another approach. You would need to automate with a real browser. You can go the Selenium route which can be written in a number of languages or use CasperJS with Javascript as the programming language.



Source: http://stackoverflow.com/questions/24611046/scraping-data-in-dynamic-sites

Friday 22 August 2014

What is the right way of storing screen-scraping data?

i'm working on a web site. it is scraping product details(names, features, prices etc.) from various web sites, processing and displaying them. i'am considering to run update script on each day and keep data fresh.

    scrape data
    process them
    store on database
    read(from db) and display them

i'am already storing all the data in a sql schema but i'm not sure. After each update, all the old records are vanishing. if the scraped new data comes corrupted somehow, there is nothing to show.

so, is there any common way to archive the old data? which one is more convenient: seperate sql schemas or xml files? or something else?

Source: http://stackoverflow.com/questions/13686474/what-is-the-right-way-of-storing-screen-scraping-data

Scraping dynamic data

I am scraping profiles on ask.fm for a research question. The problem is that only the top most recent questions are viewable and I have to click "view more" to see the next 15.

The source code for clicking view more looks like this:

<input class="submit-button-more submit-button-more-active" name="commit" onclick="return Forms.More.allowSubmit(this)" type="submit" value="View more" />

What is an easy way of calling this 4 times before scraping it. I want the most recent 60 posts on the site. Python is preferable.

You could probably use selenium to browse to the website and click on the button/link a few times. You can get that here:

    https://pypi.python.org/pypi/selenium

Or you might be able to do it with mechanize:

    http://wwwsearch.sourceforge.net/mechanize/

I have also heard good things about twill, but never used it myself:

    http://twill.idyll.org/



Source: http://stackoverflow.com/questions/19437782/scraping-dynamic-data

Web Scraping data from different sites


I am looking for a few ideas on how can I solve a design problem I'm going to be faced with building a web scraper to scrape multiple sites. Writing the scraper(s) is not the problem, matching the data from different sites (which may have small differences) is.

For the sake of being generic assume that I am scraping something like this from two or more different sites:

    public class Data {
        public int id;
        public String firstname;
        public String surname;
        ....
    }

If i scrape this from two different sites, I will encounter the situation where I could have the following:

Site A: id=100, firstname=William, surname=Doe

Site B: id=1974, firstname=Bill, surname=Doe

Essentially, I would like to consider these two sets of data the same (they are the same person but with their name slightly different on each site). I am looking for possible design solutions that can handle this.

The only idea I've come up with is scraping the data from a third location and using it as a reference list. Then when I scrape site A or B I can, over time, build up a list of failures and store them in a list for each scraper so that it can know (if i find id=100 then i know that the firstname will be William etc). I can't help but feel this is a rubbish idea!

If you need any more info, or if you think my description is a bit naff, let me know!

Thanks,

DMcB


Source: http://stackoverflow.com/questions/23970057/web-scraping-data-from-different-sites

Monday 11 August 2014

How Your Online Information is Stolen - The Art of Web Scraping and Data Harvesting

Web scraping, also known as web/internet harvesting involves the use of a computer program which is able to extract data from another program's display output. The main difference between standard parsing and web scraping is that in it, the output being scraped is meant for display to its human viewers instead of simply input to another program.

Therefore, it isn't generally document or structured for practical parsing. Generally web scraping will require that binary data be ignored - this usually means multimedia data or images - and then formatting the pieces that will confuse the desired goal - the text data. This means that in actually, optical character recognition software is a form of visual web scraper.

Usually a transfer of data occurring between two programs would utilize data structures designed to be processed automatically by computers, saving people from having to do this tedious job themselves. This usually involves formats and protocols with rigid structures that are therefore easy to parse, well documented, compact, and function to minimize duplication and ambiguity. In fact, they are so "computer-based" that they are generally not even readable by humans.

If human readability is desired, then the only automated way to accomplish this kind of a data transfer is by way of web scraping. At first, this was practiced in order to read the text data from the display screen of a computer. It was usually accomplished by reading the memory of the terminal via its auxiliary port, or through a connection between one computer's output port and another computer's input port.

It has therefore become a kind of way to parse the HTML text of web pages. The web scraping program is designed to process the text data that is of interest to the human reader, while identifying and removing any unwanted data, images, and formatting for the web design.

Though web scraping is often done for ethical reasons, it is frequently performed in order to swipe the data of "value" from another person or organization's website in order to apply it to someone else's - or to sabotage the original text altogether. Many efforts are now being put into place by webmasters in order to prevent this form of theft and vandalism.

Source:http://ezinearticles.com/?How-Your-Online-Information-is-Stolen---The-Art-of-Web-Scraping-and-Data-Harvesting&id=923976

Sunday 3 August 2014

Data Extraction - A Guideline to Use Scrapping Tools Effectively

So many people around the world do not have much knowledge about these scrapping tools. In their views, mining means extracting resources from the earth. In these internet technology days, the new mined resource is data. There are so many data mining software tools are available in the internet to extract specific data from the web. Every company in the world has been dealing with tons of data, managing and converting this data into a useful form is a real hectic work for them. If this right information is not available at the right time a company will lose valuable time to making strategic decisions on this accurate information.

This type of situation will break opportunities in the present competitive market. However, in these situations, the data extraction and data mining tools will help you to take the strategic decisions in right time to reach your goals in this competitive business. There are so many advantages with these tools that you can store customer information in a sequential manner, you can know the operations of your competitors, and also you can figure out your company performance. And it is a critical job to every company to have this information at fingertips when they need this information.

To survive in this competitive business world, this data extraction and data mining are critical in operations of the company. There is a powerful tool called Website scraper used in online digital mining. With this toll, you can filter the data in internet and retrieves the information for specific needs. This scrapping tool is used in various fields and types are numerous. Research, surveillance, and the harvesting of direct marketing leads is just a few ways the website scraper assists professionals in the workplace.

Screen scrapping tool is another tool which useful to extract the data from the web. This is much helpful when you work on the internet to mine data to your local hard disks. It provides a graphical interface allowing you to designate Universal Resource Locator, data elements to be extracted, and scripting logic to traverse pages and work with mined data. You can use this tool as periodical intervals. By using this tool, you can download the database in internet to you spread sheets. The important one in scrapping tools is Data mining software, it will extract the large amount of information from the web, and it will compare that date into a useful format. This tool is used in various sectors of business, especially, for those who are creating leads, budget establishing seeing the competitors charges and analysis the trends in online. With this tool, the information is gathered and immediately uses for your business needs.

Another best scrapping tool is e mailing scrapping tool, this tool crawls the public email addresses from various web sites. You can easily from a large mailing list with this tool. You can use these mailing lists to promote your product through online and proposals sending an offer for related business and many more to do. With this toll, you can find the targeted customers towards your product or potential business parents. This will allows you to expand your business in the online market.

There are so many well established and esteemed organizations are providing these features free of cost as the trial offer to customers. If you want permanent services, you need to pay nominal fees. You can download these services from their valuable web sites also.

Source:http://ezinearticles.com/?Data-Extraction---A-Guideline-to-Use-Scrapping-Tools-Effectively&id=3600918