Unzipping and Shuffling GBs of Data Using Azure Functions

Unzipping and Shuffling GBs of Data Using Azure Functions

Consider this situation: you have a zip file stored in an Azure Blob Storage container (or any other location for that matter). This isn’t just any zip file; it’s large, containing gigabytes of data. It could be big data sets for your machine learning projects, log files, media files, or backups. The specific content isn’t the focus - the size is.

The task? We need to unzip this massive file(s) and relocate its contents to a different Azure Blob storage container. This task might seem daunting, especially considering the size of the file and the potential number of files that might be housed within it.

Why do we need to do this? The use cases are numerous. Handling large data sets, moving data for analysis, making backups more accessible - these are just a few examples. The key here is that we’re looking for a scalable and reliable solution to handle this task efficiently.

Azure Data Factory is arguably a better fit for this sort of task, but In this blog post, we will specifically demonstrate how to establish this process using Azure Functions. Specifically we will try to achieve this within the constraints of the Consumption plan tier, where the maximum memory is capped at 1.5GB, with the supporting roles of Azure CLI and PowerShell in our setup.

Setting Up Our Azure Environment

Before we dive into scripting and code, we need to set the stage - that means setting up our Azure environment. We’re going to create a storage account with two containers, one for our Zipped files and the other for Unzipped files.

To create this setup, we’ll be using the Azure CLI. Why? Because it’s efficient and lets us script out the whole process if we need to do it again in the future.

  1. Install Azure CLI: If you haven’t already installed Azure CLI on your local machine, you can get it from here.

  2. Login to Azure: Open your terminal and type the following command to login to your Azure account. You’ll be prompted to enter your credentials.

    1
    az login    
  3. Create a Resource Group: We’ll need a Resource Group to keep our resources organized. We’ll call this rg-function-app-unzip-test and create it in the eastus location (you can ofcourse choose which ever region you like).

    1
    az group create --name rg-function-app-unzip-test --location eastus    
  4. Create a Storage Account: Next, we’ll create a storage account within our Resource Group. We’ll name it unziptststorageacct.

    1
    az storage account create --name unziptststorageacct --resource-group rg-function-app-unzip-test --location eastus --sku Standard_LRS    
  5. Create the Blob Containers: Finally, we’ll create our two containers, ‘Zipped’ and ‘Unzipped’ in the unziptststorageacct storage account.

    1
    2
    az storage container create --name zipped --account-name unziptststorageacct
    az storage container create --name unzipped --account-name unziptststorageacct

    Now your Azure environment is ready with the specific resource group and storage account names you provided! We’ve got our storage account unziptststorageacct and two containers ‘Zipped’ and ‘Unzipped’ set up for our operations. The next step is to create our zip file.

Concocting Our Data With PowerShell

Our next task is to create a large zip file filled with multiple 100MB files, all brimming with random text. In a real world scenario you would already have these large files, but since we are simulating lets use PowerShell to create them.

If you already have an existing zip file with large’ish files for testing, you can skip this step and use that file instead.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Set the number of files we want to create
$fileCount = 10

# The path where you want to create the TestFiles directory
$directory = "C:\_Temp\TestFiles"

# Create a new directory for our files if it doesn't already exist
if(-not (Test-Path -Path $directory)){
New-Item -ItemType Directory -Path $directory
}

# Loop through and create our files
for ($i=1; $i -le $fileCount; $i++){
# Generate a 100MB file filled with random text and save it in our new directory
$fileContent = New-Object byte[] 104857600
(New-Object Random).NextBytes($fileContent)
[System.IO.File]::WriteAllBytes("$directory\File$i.txt", $fileContent)
}

# Now that we have all our files, let's zip them up
Compress-Archive -Path "$directory\*" -DestinationPath "$directory.zip"

This is a simple script that is creating 10 files, each 100MB in size, and then zipping them up into a single file. The resulting zip file should be around the 1GB in size.

Incase you are wondering how we end up with a 1GB+ file by compressing 1GB worth of data? we are generating files filled with random bytes. Compression algorithms work by finding and eliminating redundancy in the data. Since random data has no redundancy, it cannot be compressed. In fact, trying to compress random data can even result in output that is slightly larger than the input, due to the overhead of the compression format.

We’ll use this file to test our Azure Function.

Azure Function To Unzip

We’re going to create a Function that magically springs into action the moment a blob (our zipped file) lands in the ‘Zipped’ container. This function will stream the data, unzip the files, and stores them neatly as individual files in the ‘Unzipped’ container.

Before we begin, ensure that you’ve installed the Azure Functions Core Tools locally. You’d also need the Azure Functions Extension for Visual Studio Code.

First lets use the CLI to create our consumption plan function app. We’ll call it unzipfunctionapp and use the unziptststorageacct storage account we created earlier. We’ll also specify the runtime as dotnet and the functions version as 4. We are using the consumption plan to demonstrate that this solution can work within the constraints of the consumption plan, where the maximum memory is capped at 1.5GB.

1
az functionapp create --resource-group rg-function-app-unzip-test --consumption-plan-location eastus --runtime dotnet --functions-version 4 --name unzipfunctionapp123 --storage-account unziptststorageacct

You might need to change the function name in the example about from ‘unzipfunctionapp123’. This could already be taken; this is because, Azure function app name must have Globally unique name.
When you create Azure function app, you specify the name which becomes part of URL.azurewebsites.net

If the function app name is already taken you will get an error ‘Website with given name unzipfunctionapp already exists.’ when you run the cli command above.

Now that we have a consumption plan function infra, lets see the full code that will do the actual task of unzipping and uploading
There are two code samples and both are quite similar in their basic approach. They both handle the data in a streaming manner, which allows them to deal with large files without consuming a lot of memory.

However, there are some differences in the details of how they handle the streaming, which may have implications for their performance and resource usage:

  • The first code sample uses the ZipArchive class from the .NET Framework, which provides a high-level, user-friendly interface for dealing with zip files. The second code sample uses the ZipInputStream class from the SharpZipLib library, which provides a lower-level, more flexible interface.
  • In the first code sample, the ZipArchive automatically takes care of reading from the blob stream and unzipping the data. It provides an Open method for each entry in the zip file, which returns a stream that you can read the unzipped data from. In the second code sample, you manually read from the ZipInputStream and write to the blob stream using the StreamUtils.Copy method.
  • The second code sample manually handles the buffer size with new byte[4096] for copying data from the zip input stream to the blob output stream. In contrast, the first code sample relies on the default buffer size provided by the UploadFromStreamAsync method.

Memory wise both are similar (i.e.: they don’t download the entire zip file into memory), but the first script takes around 20 minutes to process a 1GB zip file (with 10 * 100 MB files), whereas the second script takes about 10 minutes for the same 1GB zip file. This mainly comes down to setting the custom buffer size and the optimizations in the SharpZipLib library

First script has the benefit of not importing any custom library, but cant not run on an Azure consumption plan, at the time of this writing, consumption plan has a max 10 minute runtime.
Second script can potentially run on a consumption plan, but comes at a cost of having to import a 3rd party library.

References

Author

Ricky Gummadi

Posted on

2023-05-19

Updated on

2023-05-22

Licensed under

Comments