Saturday, 26 April 2014

How to scrape using OutWit Hub

Outwit Hub is a great tool for getting data from webpages for non programmers. Let's have a look at how to use it.

First grab yourself a free version of the software by downloading it from here.


Next choose a page that you want to extract data from. Outwit is good in that it can take data from multiple pages that have the same format.

For this example, I am going to use a job search on journalism.co.uk.

So first do a search by any criteria, I'm going to look at jobs in England because this has a nice chunk of results.

So paste the URL of the page into Outwit's browser.

Now click on the Scrapers tab on the left side of the page, and click 'new' to build a new scraper.

A word of warning: Outwit works by grabbing text that you tell it to. To do this, you have to tag what comes before and what comes after that text. To do this, you're going to have to get into the code. But don't worry, you don't need to know what it means.

To build your scraper, you're going to need to choose what you want to get off the page. I'm going to get the job title, salary and description. So enter your categories into your scraper by clicking on the rows underneath where it says 'description' in the bottom half of the page. It will look like this:



Next you have to choose your markers. To do this, look at the code. The code lives in the central window, so expand that. It will look like this:


 I know it looks scary, but we have a secret weapon: the 'find in page' function. This will help us locate the stuff we want within the code. So let's look for the first job title. In this instance it is 'Senior Account Manager' so type that in to the box. 

Tip: Text that appears on the webpage is in black in Outwit.

So now copy and paste the code that comes before it and after it, and put it into your scraper.

Tip: Trial and error is a part of using Outwit. If your markers aren't giving you what you want, you might have to try being more (or less) specific.

Repeat this for each of your categories. Your scraper will now look like this:



Now click the 'execute' button to test it out. I've chucked in URL for good measure so I can follow up these jobs if I spot a good one. Plus, it's easy to scrape for because it comes in the code right after the job title.

You should have a page with some result that looks like this: 

Now check your results are matching by going back to the original webpage to see if your job titles match your descriptions.

Click on 'export' if everything looks good, and there you have it, a spreadsheet full of jobs AKA your new to-do list. 

No comments:

Post a Comment