Quantcast
Channel: Active questions tagged row - Stack Overflow
Viewing all articles
Browse latest Browse all 445

CSV File from S3 Returns 0 Rows When Loaded with Pandas Despite Manual Download Working

$
0
0

I’m facing an issue when downloading a CSV file from S3, processing it, and reading it into a Pandas DataFrame. Here’s the situation:

  1. I’m downloading a file from an S3 bucket as a byte stream, saving itlocally, and then reading it with pandas.
  2. The file downloads and saves correctly (as far as I can tell):
  3. The file size matches what’s reported in S3. The file is saved locally without errors.
  4. However, when I load the file into Pandas, it shows 0 rows despite having columns.

What I’ve Tried:

  1. Manually Downloading and Reading the File: It works perfectly when I manually download the file from S3 using the AWS console and load it into Pandas. It even had over a million rows.

  2. Encoding Checks: Used chardet to detect encoding. It usually detects utf-8, ascii, or similar encoding for the byte stream. I also tried to manually set the encoding but that didn't work either.

  3. Retrying Full File Download: Since initially I was downloading the file in bytes from S3 into the local memory, I thought maybe there was something wrong with the construction of the file using the bytes, so as a fallback mechanism I added a check that would download the whole file from S3 if the row counts for the file is 0 (with the byte approach), but this was to no avail either.

  4. Debugging the Stream: Previewed the first 1,000 bytes of the file content they appear as binary zeros (\x00), which seems wrong. Note: When downloaded manually and opened using pandas it works.

Key Points

  1. The manually downloaded file works perfectly with pandas, so the file in S3 isn’t corrupted.
  2. My current implementation involves downloading the file in chunks, reconstructing it from a stream, and then saving it locally.
  3. Despite following the same logic manually (downloading and saving), the programmatically downloaded file doesn’t load rows in pandas.

Example

s3 = boto3.client('s3', region_name='', aws_access_key_id='', aws_secret_access_key='')S3_BUCKET_NAME = ""S3_FILE_KEY = ""  def download_and_read_file(s3_bucket, s3_key):    try:    print(f"Downloading file: {s3_key}")    response = s3.get_object(Bucket=s3_bucket, Key=s3_key)    file_content = response['Body'].read()    local_filename = "downloaded_file.csv"    with open(local_filename, "wb") as f:        f.write(file_content)    print(f"File saved locally as {local_filename}")    # loading the file into pandas    df = pd.read_csv(local_filename, low_memory=False)    print(f"Total rows in DataFrame: {df.shape[0]}")    print(df.head())except Exception as e:    print(f"Error: {e}")# calling the functiondownload_and_read_file(S3_BUCKET_NAME, S3_FILE_KEY)
Absolutely any insight will be appreciated.Below is an output example showcasing the my byte approach does work for some csv files:Downloading missing chunk: bytes=0-104857599Updated byte_progress_log.txt locally.Uploaded byte_progress_log.txt to S3.Downloading missing chunk: bytes=104857600-209715199Updated byte_progress_log.txt locally.Uploaded byte_progress_log.txt to S3.Downloading missing chunk: bytes=209715200-314572799Updated byte_progress_log.txt locally.Uploaded byte_progress_log.txt to S3.Downloading missing chunk: bytes=314572800-419430399Updated byte_progress_log.txt locally.Uploaded byte_progress_log.txt to S3.Downloading missing chunk: bytes=419430400-524287999Updated byte_progress_log.txt locally.Uploaded byte_progress_log.txt to S3.Downloading missing chunk: bytes=524288000-549793036Updated byte_progress_log.txt locally.Uploaded byte_progress_log.txt to S3.Buffer size after downloading missing chunks: 549793037 bytesDownload completeFile savedSuccessfully loaded file into pandas. Total rows: 1287454But for some csv files, this same approach gives me 0 rows. I then tried downloading that same file manually from S3 and did:    data = pd.read_csv("path",low_memory=False)    total_rows = data.shape[0]    print(total_rows)Now, that works like a charm. However, the below code gives me 0 rows:    # re-attempting to load the file with pandas    df = pd.read_csv(local_filepath, low_memory=False)    total_rows = df.shape[0]    print(f"📊 Successfully loaded full file into pandas. Total rows: {total_rows}")    print(df.head())

Viewing all articles
Browse latest Browse all 445

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>