Excel’s Power Query (or Get & Transform since Excel 2016) is a great tool for building queries to get data from the web. Within a couple of minutes you can build a query that will pull data from a webpage and transform it into the desired format. This is great for getting data from a webpage that is updated frequently as you will be able easily refresh your query to pull the new data.

Remember, if you’re not using Excel 2016 or later, then you’ll need to install the power query add-in.

Data to Extract

In this post we’re going to take a look at how we can pull data from a series of similar pages. I’m a big MMA fan, so the example we’re going to look at is getting a list of all UFC results from Wikipedia.

Wikipedia-UFC-Event-Results How To Extract Data From Multiple Webpages With Power Query

If you visit the Wikipedia page for UFC events there’s a table of Past Events. If you click on one of the events you’ll see a results table. If you look at a few more events, you’ll notice the structure is the exact same and they all have a results table. This is the data I want to get, but from all 400+ events listed in the past event section. If the number of pages was any larger, you might be better off using another tool like Python, but we’re going to be using Power Query.

Create a Query Function

First, we will create a query to extract the data on one page. We will then turn this into a function query where the input is an event page URL. This way we can apply the query to each URL in a list of all the URL’s.

How-To-Extract-Data-From-Multiple-Webpages-Create-a-From-Web-Query How To Extract Data From Multiple Webpages With Power Query

Head to the Data tab in the ribbon and press the From Web button under the Get & Transform section. If you’re working with Excel 2013 or earlier via the add-in, then this will be found under the Power Query tab.

How-To-Extract-Data-From-Multiple-Webpages-Enter-URL How To Extract Data From Multiple Webpages With Power Query

Enter the URL and press the Ok button.

How-To-Extract-Data-From-Multiple-Webpages-Navigator-and-Data-Preview How To Extract Data From Multiple Webpages With Power Query

Excel will connect with the page and the Navigator dialog box will open.

  1. A list of tables available to import from the webpage will be listed. Select the Results table.
  2. A preview of our selected data will appear.
  3. Press the Edit button. This will open the Query Editor window.

How-To-Extract-Data-From-Multiple-Webpages-Rename-our-Function-Query How To Extract Data From Multiple Webpages With Power Query

Rename the query to fGetWikiResults. This will be the name we call to use our query function later on.

How-To-Extract-Data-From-Multiple-Webpages-Open-Advanced-Editor How To Extract Data From Multiple Webpages With Power Query

Now we can edit our query to turn it into a query function. Go to the View tab and press the Advanced Editor button. This will allow us to edit the code that Excel has created to extract the data from this URL.

How-To-Extract-Data-From-Multiple-Webpages-fGetWikiResults-Advanced-Editor How To Extract Data From Multiple Webpages With Power Query

We will need to edit this code to the following. The parts that need to be added/changed are highlighted in red.


let GetResults=(URL) =>

let
    Source = Web.Page(Web.Contents(URL)),
    Data1 = Source{1}[Data],
    #"Changed Type" = Table.TransformColumnTypes(Data1,{{"Header", type text}, {"Weight class", type text}, {"", type text}, {"2", type text}, {"3", type text}, {"Method", type text}, {"Round", Int64.Type}, {"Time", type time}, {"Notes", type text}})
in
    #"Changed Type"
in GetResults

Press the Done button when finished editing the query. This will turn our query into a parametrized query with the URL as an input.

How-To-Extract-Data-From-Multiple-Webpages-Enter-Parameter-Screen How To Extract Data From Multiple Webpages With Power Query

You should see the data preview in the query editor has been replaced with a parameter input. We don’t need to enter anything here and we can just leave it blank.

How-To-Extract-Data-From-Multiple-Webpages-Close-and-Load How To Extract Data From Multiple Webpages With Power Query

We can then save our query function by going to the Home tab and pressing the Close & Load button.

How-To-Extract-Data-From-Multiple-Webpages-fGetWikiResults-Queries-and-Connections How To Extract Data From Multiple Webpages With Power Query

You should now see the fGetWikiResults query function in the Queries & Connections window.

Get a List of URL’s

Now we will need to get our list of event page URL’s from the Past Events page. We could use power query to import this table but this would just pull in the text and not the underlying hyperlink. The best way to get the list of URL’s is to parse the source code from the page. You can view any webpage’s source code by pressing Ctrl + U from the Chrome browser.

You’ll need to be fairly familiar with HTML to find what you’re looking for. The first couple lines of HTML we are interested in looks like this. I have highlighted the hyperlinks we’re interested in to demonstrate where they are. You can parse these out in another Excel workbook using some filters and basic text formula. We will also need to concatenate the starting part of the address (ie. https://en.wikipedia.org/wiki/UFC_217).

<tr>
<td>416</td>
<td><a href="/wiki/UFC_217" title="UFC 217">UFC 217: Bisping vs. St-Pierre</a></td>
<td><span class="sortkey" style="display:none;speak:none">000000002017-11-04-0000</span><span style="white-space:nowrap">Nov 4, 2017</span></td>
<td><a href="/wiki/Madison_Square_Garden" title="Madison Square Garden">Madison Square Garden</a></td>
<td><a href="/wiki/New_York_City,_New_York" class="mw-redirect" title="New York City, New York">New York City, New York</a>, U.S.</td>
<td><span style="display:none" class="sortkey">7004182010000000000</span>18,201<sup id="cite_ref-21" class="reference"><a href="#cite_note-21">[21]</a></sup></td>
</tr>
<tr>
<td>415</td>
<td><a href="/wiki/UFC_Fight_Night:_Brunson_vs._Machida" title="UFC Fight Night: Brunson vs. Machida">UFC Fight Night: Brunson vs. Machida</a></td>
<td><span class="sortkey" style="display:none;speak:none">000000002017-10-28-0000</span><span style="white-space:nowrap">Oct 28, 2017</span></td>
<td><a href="/wiki/Gin%C3%A1sio_do_Ibirapuera" title="Ginásio do Ibirapuera">Ginásio do Ibirapuera</a></td>
<td><a href="/wiki/S%C3%A3o_Paulo" title="São Paulo">São Paulo</a>, Brazil</td>
<td><span style="display:none" class="sortkey">7004102650000000000</span>10,265<sup id="cite_ref-22" class="reference"><a href="#cite_note-22">[22]</a></sup></td>
</tr>

How-To-Extract-Data-From-Multiple-Webpages-Table-of-Event-URLs How To Extract Data From Multiple Webpages With Power Query

Once we have the full list of event URL’s, we can turn the list into an Excel Table using the Ctrl + T keyboard shortcut and name it URL_List.

Use the Query Function on our URL List

We are now ready to use the fGetWikiResults query function on our list of event URL’s.

How-To-Extract-Data-From-Multiple-Webpages-From-Table-or-Range-Query How To Extract Data From Multiple Webpages With Power Query

Create a query based on the URL_List table. Select a cell in the table and go to the Data tab in the ribbon and press the From Table/Range button under the Get & Transform section.

How-To-Extract-Data-From-Multiple-Webpages-Add-a-Custom-Column How To Extract Data From Multiple Webpages With Power Query

Now we will add a custom column to the query. This is where we’ll invoke our fGetWikiResults query function. Go to the Add Column tab and press the Custom Column button.

How-To-Extract-Data-From-Multiple-Webpages-Create-Custom-Column How To Extract Data From Multiple Webpages With Power Query

Add a New column name to the custom column and then add the Custom column formula fGetWikiResults([URL]).

How-To-Extract-Data-From-Multiple-Webpages-Expand-Custom-Column How To Extract Data From Multiple Webpages With Power Query

The new custom column will contain a Table for each URL and we will need to expand this table to see the results. Left click on the Results Data column filter icon seen in the column heading. Select Expand from the menu and press the OK button.

How-To-Extract-Data-From-Multiple-Webpages-Rename-Column-Headings How To Extract Data From Multiple Webpages With Power Query

Some of the column headings were missing in our source data, so we can rename them. Double left click on the column heading to rename it.

How-To-Extract-Data-From-Multiple-Webpages-Extracted-Data How To Extract Data From Multiple Webpages With Power Query

We can now Close & Load the query and the results data will load into a new sheet. This will take a good few minutes so be patient. This is why you should probably start considering Python or similar tools if you have any more pages than this example.