Allow disabling gzip decompression in GCS Hadoop connector [304991127]

Feature Request

Status Update

No update yet.

Description

im...@google.com

created issue #1

Oct 12, 2023 02:25PM

Summary:

Hadoop's default behaviour is to automatically decompress files with the .gz extension (see here).

When gzip encoding is enabled (fs.gs.inputstream.support.gzip.encoding.enable=true), upon reading gzip-encoded files from GCS, both the GCS connector and Hadoop FS will attempt to decompress the file, leading to errors like:

Caused by: java.io.IOException: incorrect header check
	at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
	at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:227)
	at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
[...]

Since disabling the gzip decompression behaviour in Hadoop is not possible without changing the hadoop-core library, it's helpful if the GCS connector can automatically skip the decompression when the file extension is .gz or at least provide a configuration property for disabling the automatic decompression.

Github Issue: https://github.com/GoogleCloudDataproc/hadoop-connectors/issues/1060

Reproduction Steps:

Create a gzip-encoded GCS object with .gz extension and set for it the metadata header Content-Encoding: gzip
Set cluster property core:fs.gs.inputstream.support.gzip.encoding.enable=true
Attempt to read the file in Spark

Mitigation:

Either unset the Content-Encoding: gzip metadata field on the GCS object (so the connector would not decompress it) or remove the .gz extension from the object name

Comments

im...@google.com <im...@google.com> Oct 16, 2024 03:33PM

Status: New

Issue 304991127

Description

Issue summary

Comments

im...@google.com <im...@google.com> Oct 16, 2024 03:33PM

Add comment

Issue metadata