Last month the Power BI team at Microsoft released an enhanced “combine binaries” experience, that I covered here last month. That new functionality allows you to easily combine multiple CSV files (and other file types) from a folder and utilize their filenames as a column in the final result (which is awesome – if you disagree, you don’t understand it yet, go read the post! ).
As I have committed to begin a series on CSVs, it’s time for the second part of the series. I think that today’s blog post will be quite helpful for many of you. So without further ado, let’s begin part #2 of the CSV series.
Great, we have a simplified append of CSV files in Power BI, but what we can do in Excel?
While we patiently wait to have the same functionality in Excel, in today’s post I will show you how you can create relatively simple query functions to resolve our challenge and combine multiple CSV files from a folder while extracting essential information from their associated filenames. But first, let’s review three possible scenarios, and how to combine CSV files from a folder on each.
The Basic Scenario – No contextual data in filenames
If you have a folder with many CSV files that share the exact format, you can could append them all into a single table in Excel file.
Here is an example of such CSV files. Each file contains different Star Wars characters, and we want to append them all. Instead of massive repetitions of copy and paste, or a manual import of each file, here is the Power Query’s game changing way to do it.
If you have Excel 2016 In Data tab, click New Query, select From File and then click From Folder. If you are on Excel 2010 or 2013, install Power Query Add-In (download it here) and in the Power Query tab, click From File –> From Folder.
Click Browse, and select the folder that contains the CSV files. Click OK in Browser For Folder, then click OK in Folder.
If you see this preview window, click Edit.
The Query Editor window will appear. This is the main screen to apply transformations on your data. When we’re done, clicking Refresh All in the Data tab (of the ribbon) will automatically load the data from any new and/or modified files and transform it to our desired format.
The next step is where the real magic begins.
Magic – Squishing All the Files Together into One Table
(Note from Rob: yes, we’ve covered parts of this on the blog before, BUT this is a lesson that bears repeating, shouting from the rooftops, and seeing from many perspectives – PLUS it sets up the second half of this post, so this is all very valuable.)
Unfortunately, the portal to the magic land is well hidden. On the left side of the header of the first column you will notice a small button with two arrows. Click on it.
The result: All the CSV files are appended to a single table. We are still inside the Query Editor and can see a preview of the transformation. But there is still some work for us. Notice that each file contains the header names, so the appended result has all the headers as rows in the table.
In Transform tab, click Use First Row As Headers.
The last step promoted the headers of the first CSV files as the headers of the unified appended table.
Removing the Header Rows from the Other Files
Above, we “promoted” the header row from the first file, so that one is taken care of, but we still need to remove the header rows from the other CSV files. To do it, we can apply a simple filter on the first column.
Click the filter button in the header of the first column (In my case column Name), and uncheck the header name in the filter pane (In my case – Name, as highlighted below). When you click OK in the filter pane, the header rows will be filtered out.
Note: In some unexpected cases, unchecking a value in the filter pane creates a dangerous logic that will yield unexpected results. Make sure you activate the Formula Bar of the Query Editor, and confirm that the formula explicitly excludes the value that you unchecked (Keeping the Formula Bar invisible by default, is one of the most common mistakes you can do in Power Query. Read more here – but short version is that you WANT to see something like ‘[Name] <> “Name” ‘ – if you see ‘[Name] = “R2-D2” or [Name]=”R5-D4” or… ‘ then you probably want to change it to the single “<>” version).
In case your CSV files overlap and you think you have duplicate rows, you can easily remove the duplicates by selecting the columns, whose combined values are unique (In my case it’s column Name). After you select the column/s, right click on one of the headers, and select Remove Duplicates.
We are done. We can now click Close & Load in Home tab to close the Query Editor to load the append data into Excel.
And here are the results. Of course, we’d use “Load To” in order to get this into Power Pivot’s data model in most serious cases, but this time I’ll just drop it into Excel for simplicity:
Second Scenario – The contextual information is in first row of each CSV file!
(Note from Rob: YEAH, this is what I’m talking about! Even I learned some good stuff while reviewing this. Glad you stuck around? )
What if instead we want to append CSV files that share the same format, but each file has a unique context that is crucial our analysis? For example, imagine we have sales data from three regions of our company: London, New-York and Paris.
How can we keep the region after we append the data from the three CSV files?
In this section, we will learn how to append the data from all the regions, when the region name is given in the first cell of each file.
After we select our folder, we should click our magic button again:
In the next steps, we will now transform the tables, and move the regions from the first row of each file to a new column, as illustrated here:
In Add Column tab, click Conditional Column.
In Add Conditional Column dialog, Select Column2 as Column Name, equals as Operator, and “” as Value. Then, change the type of Output to Column, and set Column1 as Output. When you’re done, click OK.
Bummer: We Have to Edit the Formula Just a Touch
Note: The window above doesn’t support blank values as an input – YUCK. So we will need to fix the formula. If this is your first time using the Query Editor, and you don’t see the formula bar, go to View tab and check the Formula Bar checkbox.
In the formula bar, replace the long sequence of double quotes to two double quotes “”.
Here is the original formula (before the change):
Here is the modified formula:
Filling it Down (Yes, Filling, Not Filing)
In the next step, we will fill down the region values, so each row will include its corresponding region.
Select column Custom and right click its header. Select Fill and then select Down.
Click on the filter icon of Column2 and select Remove Empty.
In Transform tab, click Use First Row As Headers.
Now we can remove all the header rows. Click the filter icon on the first column (In my case – Date), scroll down till you see the column name as a value (In my case – Date) and uncheck it, then click OK.
We can rename the last column to City or Region.
In the last two steps we can change the column types of Sales and Date, by clicking the small ABC icon in the headers and select Decimal Number for Sales and Date for Date.
That’s it. We can now close the Query Editor. In Home tab click Close & Load.
We were able to append all the sales data and keep the region/city information. You will find this method useful in many scenarios where your CSV files contain meta-data before the actual tables. Copying the data to a new custom column, and filling it down will do the trick.
Third Scenario – The context is in the filename
(This is the part that is just like my previous post, but we’re doing it in Excel rather than Power BI)
So what should we do if our meta-data (like region names) isn’t available in the CSV itself, but in the filename. In our example, all the CSV files contain the sales data, and the region/city name is in the filename, as shown here:
Till the improved Combine Binaries feature in Power BI is released in Excel, we cannot click the Combine Binaries button in the header of column Content to append the data without losing the crucial information in the filenames.
Here is the solution:
After you import the folder, follow the steps below.
Select the first two columns Content and Name. Right click on one of the headers and select Remove Other Columns.
In Home tab, click New Source, then select Other Sources and select Blank Query.
Rename the new query to LoadCSV.
Click Advanced Editor.
Paste the following code in the Advanced Editor and click Done.
(content as binary, filename) =>
#”Imported CSV” = Csv.Document(content,[Encoding=1252, QuoteStyle=QuoteStyle.None]),
#”Promoted Headers” = Table.PromoteHeaders(#”Imported CSV”),
#”Added Custom” = Table.AddColumn(#”Promoted Headers”, “filename”, each filename)
Note: Keep the code above. You will be able to reuse it whenever you need to append CSV files and their filenames from a folder.
In Add Column tab, click Invoke Custom Function.
In Invoke Custom Function dialog, select LoadCSV as Function query.
Select column Content as content, select Column Name as the type of filename (optional).
Select Name as filename (optional), and click OK.
We can now remove the first two columns, by selecting the third column, right clicking on its header and selecting Remove Other Columns.
There is another magic button that we can use in the header of column LoadCSV. This button will expand all the tables we extracted from the CSV files, and will transform them into a single table – with the region/city context intact. Click on the icon (highlighted below), uncheck Use original column name as prefix, and click OK.
We can now manipulate the filename values to represent the cities by removing the extension from the filename).
To remove the .csv extension and keep the city names, select column filename and right click on its header. Then, select Replace Values..
In Replace Values dialog, set .csv as Value To Find, and click OK. (Note that we didn’t provide any text to replace with, as we want to delete this extension).
We can now rename the column filename to City, and change the types of Date and Sales as we did in the previous section. Finally, clicking Close & Load in Home tab will provide the expected results.
From now on, any additional CSV files that you’ll add in the folder will be automatically appended when you click Refresh All in Data tab.
Following this blog post, I hope that you’ll never again manually append multiple CSV files into your worksheet. The Power Query way is relatively easy, and it allows you to refresh the data when new CSV files come in.
Would you like to improve your Power Query skills? Follow my new series on 10 common mistakes you do in Power BI and Power Query, and and how to avoid the Pitfalls here.