I’m facing an issue when downloading a CSV file from S3, processing it, and reading it into a Pandas DataFrame. Here’s the situation:
- I’m downloading a file from an S3 bucket as a byte stream, saving itlocally, and then reading it with pandas.
- The file downloads and saves correctly (as far as I can tell):
- The file size matches what’s reported in S3. The file is saved locally without errors.
- However, when I load the file into Pandas, it shows 0 rows despite having columns.
What I’ve Tried:
Manually Downloading and Reading the File: It works perfectly when I manually download the file from S3 using the AWS console and load it into Pandas. It even had over a million rows.
Encoding Checks: Used chardet to detect encoding. It usually detects utf-8, ascii, or similar encoding for the byte stream. I also tried to manually set the encoding but that didn't work either.
Retrying Full File Download: Since initially I was downloading the file in bytes from S3 into the local memory, I thought maybe there was something wrong with the construction of the file using the bytes, so as a fallback mechanism I added a check that would download the whole file from S3 if the row counts for the file is 0 (with the byte approach), but this was to no avail either.
Debugging the Stream: Previewed the first 1,000 bytes of the file content they appear as binary zeros (\x00), which seems wrong. Note: When downloaded manually and opened using pandas it works.
Key Points
- The manually downloaded file works perfectly with pandas, so the file in S3 isn’t corrupted.
- My current implementation involves downloading the file in chunks, reconstructing it from a stream, and then saving it locally.
- Despite following the same logic manually (downloading and saving), the programmatically downloaded file doesn’t load rows in pandas.
Example
s3 = boto3.client('s3', region_name='', aws_access_key_id='', aws_secret_access_key='')S3_BUCKET_NAME = ""S3_FILE_KEY = "" def download_and_read_file(s3_bucket, s3_key): try: print(f"Downloading file: {s3_key}") response = s3.get_object(Bucket=s3_bucket, Key=s3_key) file_content = response['Body'].read() local_filename = "downloaded_file.csv" with open(local_filename, "wb") as f: f.write(file_content) print(f"File saved locally as {local_filename}") # loading the file into pandas df = pd.read_csv(local_filename, low_memory=False) print(f"Total rows in DataFrame: {df.shape[0]}") print(df.head())except Exception as e: print(f"Error: {e}")# calling the functiondownload_and_read_file(S3_BUCKET_NAME, S3_FILE_KEY)
Absolutely any insight will be appreciated.Below is an output example showcasing the my byte approach does work for some csv files:Downloading missing chunk: bytes=0-104857599Updated byte_progress_log.txt locally.Uploaded byte_progress_log.txt to S3.Downloading missing chunk: bytes=104857600-209715199Updated byte_progress_log.txt locally.Uploaded byte_progress_log.txt to S3.Downloading missing chunk: bytes=209715200-314572799Updated byte_progress_log.txt locally.Uploaded byte_progress_log.txt to S3.Downloading missing chunk: bytes=314572800-419430399Updated byte_progress_log.txt locally.Uploaded byte_progress_log.txt to S3.Downloading missing chunk: bytes=419430400-524287999Updated byte_progress_log.txt locally.Uploaded byte_progress_log.txt to S3.Downloading missing chunk: bytes=524288000-549793036Updated byte_progress_log.txt locally.Uploaded byte_progress_log.txt to S3.Buffer size after downloading missing chunks: 549793037 bytesDownload completeFile savedSuccessfully loaded file into pandas. Total rows: 1287454But for some csv files, this same approach gives me 0 rows. I then tried downloading that same file manually from S3 and did: data = pd.read_csv("path",low_memory=False) total_rows = data.shape[0] print(total_rows)Now, that works like a charm. However, the below code gives me 0 rows: # re-attempting to load the file with pandas df = pd.read_csv(local_filepath, low_memory=False) total_rows = df.shape[0] print(f"📊 Successfully loaded full file into pandas. Total rows: {total_rows}") print(df.head())