When I saw the first post in David Lyford-Smith's excellent series on creating a template to perform audit sampling, and read that the series was to cover the use of Dynamic Arrays, VBA and traditional Excel formulae, I immediately saw an opportunity to promote my obsessive Power Query agenda.
Surely, the extraction of sample items from a list of transactions would be right up Power Query's street. As it turned out. About 98% of the process was pretty straightforward but the slightly strange way in which random number functions work in Power Query threatened to dent my previously unshakeable belief in the power of Power Query to do pretty much anything.
David has set out the problem to be solved in his three posts:
Excel Tip of the Week #353 - Audit sampling templates: Dynamic arrays
Excel Tip of the Week #356 - Audit sampling templates: Traditional formulas
Excel Tip of the Week #359 - Audit sampling templates: VBA
The task is to generate a list of items, where the number of items is specified by the user, drawn randomly from a table of transactions. If we just needed to select a given number of random items, this would be pretty straightforward (although, in Power Query, perhaps not so straightforward as you might have first thought).
However, David had set a more complicated task, the use of monetary unit sampling. Not being a statistician, and being more interested in the mechanics of the method, I chose to try and approximate the method David had used in part 1 of the series: to select random values set between 0 and the total value of all the transactions, and then find which item these random values could be found in when the transaction values were arranged cumulatively. To take a simplified example, if we had the following table:
A - 5
B - 12
C - 20
Then we would identify random items between 0 and 37. If a random value fell between 0 and 5, we would include A, if it was over 5 and up to 17 we would use B and if it was over 17 and up to 37 we would use C. Taking the data from the sample template supplied by David, we will first enable the variable fields to be used within Power Query. We have allocated the Range Name SampleSize to cell B3, and SampleUnit to cell D4:
Using a Range Name in this way means that we can use From Table/Range without turning our cell into part of a Table and thereby creating an unnecessary header row.
In order to make these easily available in Power Query we will click on each in turn and use From Table/Range to read them into the editor as a table (albeit a very small one). We can then right-click on the value itself in the editor and choose Drill Down. This creates a single value that we can refer to directly in any other query just using its name. We then choose Close & Load to…, Connection only as we don't need to load the value back to the worksheet:
Repeating this process for the SampleUnit cell creates our two variables within Power Query.
Next, we read the table containing the unique Item Identifier and Value columns into Power Query.
First, we'll create the simpler, item-based, sample. To do this, we just need to allocate a random number to each row of our table, then sort by the random number and keep the number of rows defined by SampleSize.
In the previous paragraph we referred to 'just' needing to allocate a random number to each row of our table. Actually, this is not as simple as might be imagined. Power Query has three functions that generate random numbers:
Number.Random() – generates a random number between 0 and 1
Number.RandomBetween() – generates a random number between a specified bottom and top number
List.Random() - generates a list of a specified length containing random numbers between 0 and 1
You might just think we need to add a custom column containing the formula =Number.Random(). This reveals one of the great differences between Excel and Power Query. Excel works with individual cells, Power Query works with tables and columns. This is our new column:
Repeating this process for the SampleUnit cell creates our two variables within Power Query.
Next, we read the table containing the unique Item Identifier and Value columns into Power Query.
First, we'll create the simpler, item-based, sample. To do this, we just need to allocate a random number to each row of our table, then sort by the random number and keep the number of rows defined by SampleSize.
In the previous paragraph we referred to 'just' needing to allocate a random number to each row of our table. Actually, this is not as simple as might be imagined. Power Query has three functions that generate random numbers:
Number.Random() – generates a random number between 0 and 1
Number.RandomBetween() – generates a random number between a specified bottom and top number
List.Random() - generates a list of a specified length containing random numbers between 0 and 1
You might just think we need to add a custom column containing the formula =Number.Random(). This reveals one of the great differences between Excel and Power Query. Excel works with individual cells, Power Query works with tables and columns. This is our new column:
We can force Power Query to recalculate the number for each row by, for example, adding an index column before we add our random number column:
Alternatively, we can use List.Random(1) to create a new column with a list of 1 random number in it, then extract the single value from that list and change its type:
Whichever method we choose, the strangeness of using random numbers in Power Query is not over. Each further step triggers the random numbers to be recalculated, much like random numbers are recalculated in Excel whenever a general recalculation takes place. To create our list of random items we need to sort by our random column, keep the first rows where the number we keep is set by our SampleSize value:
= Table.FirstN(#"Sorted Rows",SampleSize)
We can then remove the random column so we just load our Item identifier and Value columns to our worksheet.
If you look carefully, you will see that each of these stages includes a different set of items. This has important consequences when we come to switch to monetary unit sampling.
The big difference between item sampling and monetary unit sampling is that, instead of randomly choosing discrete items, we randomly choose a value contained somewhere in our total of all values. Because larger value items contain more values, they are more likely to be chosen. In our example, the total of all our values is £535,329.15. If we were to use £1 as our sampling unit, this would mean generating random values for approximately 535,329 rows.
To replicate this approach in Power Query, we add a custom column that uses the List.Number() function to create a list of values to each row of our table. Each list will start at 1 and end at the Value on that row, divided by the sampling unit. So, if our value was 7,848.96 and our sampling unit was 1 we would generate a sequential list of 7,849 items for that row:
List.Numbers(1,([Value]/SampleUnit))
Having added our custom column, we can click on the expand icon and choose to Expand to New Rows. Our table now includes 535,382 rows – you could get closer to the exact sum of the values by rounding the Value in the List.Numbers() formula:
List.Numbers(1,(Number.Round([Value])/SampleUnit))
We can now add our random column as we did for the item sampling but, of course, this time we will be adding random numbers to over 500,000 rows.
This is where things get a little more difficult. We want to select our required number of sample items by sorting our random column and then choosing the item in which our random number is found. However, it is perfectly possible that the same item will feature more than once in our sample list, so we will need to remove duplicates to get the number of items that we need. If we just right-click in the Item identifier header and choose Remove Duplicates, then we will be back to our original 100 rows. Crucially however, just as in our item sample, the action will recalculate the random numbers and accordingly resort our list, so we won't end up with the items that our previous sort had identified. This is where the method I chose is open to question and I would be grateful for any comments as to whether it is still statistically valid. I used Transform, Group By to group the items by Item identifier and Value, and then aggregate the Max of the Random column:
Although all the subsequent operations still cause the sample to be recalculated, it seems that the recalculation involves working with the entire population of 500,000+ rows, rather than just the 100 unique rows. Repeatedly refreshing the queries and comparing the Monetary Unit sample with the Item sample certainly suggests that the Monetary Unit sample favours the higher value items:
This example did challenge both my knowledge of statistics and my understanding of how Power Query deals with random numbers. In part 2, we will look at trying to create a more elegant and scalable solution. We will also be inviting you to point out any flaws in the methods covered and to suggest alternative and better solutions.