Extracting Specific Files from Zip Archives with a Python Script


In this blog post, I'll show you how to create a Python script that extracts files from zip archives to a specific path on your hard drive. Google Takeout will allow users to extract all of their Google Photos to any number of large zip files. After downloading, you are still required to extract to your own file system. This script can help that by iterating over each zip and perform the extraction.

- June 23, 2024

Rest of the Story:

Extracting Specific Files from Zip Archives with a Python Script

When managing large amounts of data, especially backups from services like Google Photos, you may find yourself dealing with numerous zip files. Manually extracting specific files from these archives can be tedious. Fortunately, Python offers a powerful way to automate this process.

In this blog post, I'll show you how to create a Python script that extracts files from zip archives to a specific path on your hard drive. You can specify a particular path within each zip file to extract files from, making it ideal for organizing photos or other data from Google backups. The script below will iterate over each zip file within the directory and perform the same extraction from within each zip to another target directory. In our particular case we had 17 - 10 gb zip files and we need to iterate over each to extract the photos.

image

Prerequisites

Before we dive into the script, ensure you have Python installed on your machine. You can download it from python.org. You'll also need the zipfile and os modules, which are part of Python's standard library, so no additional installations are required.

The Script

Here's the Python script to extract files from zip archives:

import os
import zipfile

def extract_specific_path_from_zip(source_dir, dest_dir, specific_path):
    # Ensure destination directory exists
    if not os.path.exists(dest_dir):
        os.makedirs(dest_dir)
    
    # Loop through all files in the source directory
    for item in os.listdir(source_dir):
        # Construct full file path
        file_path = os.path.join(source_dir, item)
        
        # Check if the file is a ZIP file
        if zipfile.is_zipfile(file_path):
            # Open the ZIP file
            with zipfile.ZipFile(file_path, 'r') as zip_ref:
                # Get the list of all archived file names from the zip
                zip_file_names = zip_ref.namelist()
                
                # Filter the files that match the specific path
                files_to_extract = [f for f in zip_file_names if f.startswith(specific_path)]
                
                # Extract the filtered files preserving the directory structure below the specific path
                for file in files_to_extract:
                    try:
                        # Determine the relative path within the specific path
                        relative_path = os.path.relpath(file, specific_path)
                        
                        # Determine the output file path
                        output_file_path = os.path.join(dest_dir, relative_path)
                        
                        # Normalize the output file path to handle any issues with invalid characters or spaces
                        output_file_path = os.path.normpath(output_file_path)
                        
                        # Ensure the directory exists
                        os.makedirs(os.path.dirname(output_file_path), exist_ok=True)
                        
                        # Extract the file data and write to the destination directory
                        with zip_ref.open(file) as source_file:
                            with open(output_file_path, 'wb') as output_file:
                                output_file.write(source_file.read())
                        
                        print(f"Extracted: {file} from {file_path} to {output_file_path}")
                    except Exception as e:
                        print(f"Failed to extract {file} from {file_path}. Error: {e}")

if __name__ == "__main__":
    source_directory = "C:/Temp/May23_BrittmarieGoogleBackup"  # Replace with the path to the source directory
    destination_directory = "G:/Home_PC/Pictures/2023-06-23"  # Replace with the path to the destination directory
    specific_path_in_zip = "Takeout/Google Photos"  # Replace with the specific path to extract from each ZIP file

    extract_specific_path_from_zip(source_directory, destination_directory, specific_path_in_zip)

How It Works

  1. Import Modules: The script begins by importing the zipfile and os modules.
  2. Function Definition: The extract_specific_path_from_zip function takes three parameters:
    • source_dir: The directory containing the zip files you want to extract files from.
    • dest_dir: The directory where you want the extracted files to be saved.
    • specific_path: The specific path within each zip file from which you want to extract files.
  3. Ensure Destination Directory Exists: The script checks if the destination directory exists and creates it if necessary.
  4. Loop Through Source Directory: It loops through all files in the source directory, checking each one to see if it's a zip file.
  5. Open Zip File: For each zip file, the script opens it and lists all the files within the zip archive.
  6. Filter Files: The script filters the files based on the specified path within the zip file.
  7. Extract Files: For each filtered file, the script constructs the target file path, creates necessary directories, and writes the file to the target path, handling any issues with invalid characters or spaces.

Usage

To use this script, replace the placeholder values for source_directory, destination_directory, and specific_path_in_zip with your specific paths. For example, if you want to extract all photos from a Google Photos backup, you might have:

if __name__ == "__main__":
    source_directory = "C:/Temp/May23_BrittmarieGoogleBackup"  # Replace with the path to the source directory
    destination_directory = "G:/Home_PC/Pictures/2023-06-23"  # Replace with the path to the destination directory
    specific_path_in_zip = "Takeout/Google Photos"  # Replace with the specific path to extract from each ZIP file

    extract_specific_path_from_zip(source_directory, destination_directory, specific_path_in_zip)

Save the script as extract_photos.py and run it using:

python extract_photos.py

This script will extract all files from the specified path within each zip archive and save them to your target directory, maintaining the directory structure.

Conclusion

This updated Python script is a handy tool for automating the extraction of specific files from zip archives. Whether you're organizing photos from Google or managing backups, this script can save you time and effort. Happy coding!