Some Pitfalls and Reflections on S3A

Recently, there has been an architecture upgrade in my work, migrating the original EMR cluster to a self-built cluster based on open source. Naturally, some components that were previously used also need to be transformed, including s3. In our self-built cluster, we use the open source hadoop s3a client (or s3a connector, abbreviated as s3a below, with similar meanings) to connect to the existing s3 storage. Next, I will share some experiences I have summarized when using the s3a client.
ACG.GY_06

Types of s3 Protocols#

First, let's introduce s3. The full name of s3 should be "AWS S3", also known as "AWS Amazon Simple Storage Service". It is an object storage service developed by Amazon, which is a complete storage product including web services. In a broad sense, it is also specifically referring to AWS's object storage. It accesses files on storage through its own s3 protocol, and the path is similar to: s3://xxx/yyy/zzz

Given AWS's market dominance, s3 storage is widely used in various industries. However, the s3 protocol is AWS's proprietary protocol and can only be used by AWS products. In order to consolidate and expand the market position of s3, and allow users to connect to s3 storage from anywhere and confidently store data on s3, AWS provides an open source version of the s3 client for Hadoop to connect to s3, which is the s3n client. The file path under the s3n protocol is similar to: s3n://xxx/yyy/zzz

Later, Hadoop phased out the s3n client during the upgrade and adopted the brand new s3a client. The file path under the s3a protocol is similar to: s3a://xxx/yyy/zzz

(The s3 client initially introduced by Hadoop also used the s3 protocol, with file paths similar to: s3://, but it has been deprecated due to its age, so I won't mention it here)

To summarize the differences between the three file protocols:

The s3 protocol is AWS's initial and proprietary protocol, widely used in AWS products such as EMR and EC2. If your company uses the AWS ecosystem, you will mostly access s3 through the s3 protocol.
s3n and s3a are open source s3 clients developed on Hadoop based on the AWS s3 SDK. They can break free from the limitations of AWS products and freely access s3 storage in self-built services. They are suitable for scenarios that only use AWS's s3 storage service and need to interact with multiple products.
In terms of performance, there is little difference between the three. Considering that s3 is maintained by AWS itself, its version iteration speed may be faster than that of open source s3n and s3a.
In terms of compatibility, s3n and s3a are theoretically compatible with the s3 protocol as derivatives. Generally, s3n and s3a are also compatible with each other (i.e., upgrading from s3n to s3a), but there are some differences in directory creation between s3n and s3a (which will be mentioned in the next part), so attention is needed in some scenarios.

Differences in Directory Handling in s3a#

In fact, many people do not consider s3 as a file system, as it is an object storage and has significant differences from traditional file systems. The official documentation also states:

Amazon S3 is not a filesystem, it is an object store.

One important difference between the two is the interpretation of the directory concept.

In traditional Unix-style file systems, such as HDFS, the file system is composed of a directory file tree, which means that the generated directories are "always present", regardless of whether there are files in the directory or not.
In the "file system" of s3, which is based on object storage (or block storage), the directories are "virtual" because s3 recognizes directories through prefix matching. For example, if two objects have the same path prefix: a/b/file1 and a/b/file2, s3 will consider that there is a directory a/b.

The difference between the two in this aspect can lead to various pitfalls in practical use.

Differences in Specific Scenarios#

Let's further divide the scenarios:
Creating Directories

HDFS and other directory file trees create an empty directory, where files and directories can be added, and the directory can be discovered at any time (regardless of whether there are files or other directories in the directory) through the ls command.
In fact, if s3 could do this, it could also be considered as implementing a directory file tree, but unfortunately it cannot. Since directories in s3 are identified by prefixes, when there are no files in a directory, a directory marker (DM) is needed to mark the directory. When a file is created in the directory, this DM will be deleted; conversely, when the directory becomes an empty directory, the DM will be added again.
- In s3a, a file named path_name+/ is generated as the DM in the scenario of an empty directory. For example, executing mkdir(s3a://bucket/a/b) will create a marker object a/b/.
- In the older s3n, the DM is in the form of path_name_$folder$.

I believe you can already see the problem here. When there are differences in certain conventions between new and old versions of products, problems often arise. For example, when you create a directory using the aws s3 command but connect with s3a, you often cannot find the directory because the directory created by the aws s3 tool does not create an s3a DM, so the directory cannot be found under the s3a protocol. Similarly, when a cluster upgrades from a lower version of Hadoop to a higher version, special attention needs to be paid to the existence of empty directories, as it is possible to upgrade from s3n to s3a while recognizing DM has changed, resulting in the inability to detect the original empty directories.

Creating Files

In general directory file tree file systems, such as HDFS, only a single file needs to be created according to the path of the directory.
In s3, creating a file may involve a series of DM deletion operations. s3a needs to include requests to delete all parent DMs in a single request.

Deleting Directories and Files

Deleting directories and files in HDFS directory file trees is basically the same, and the deletion semantics are more in line with general logic.
In s3, due to the existence of DMs, when deleting files/directories, if the parent directory becomes an empty directory, the DM needs to be added again.

Existing Issues#

Although the handling of directories in s3 is a compromise, it has indeed become the source of many problems.

The different DMs used in s3n and s3a mean that the Hadoop version where s3a is located cannot be backward compatible.
When creating or deleting files with s3a, a batch of DMs may need to be deleted or created, which can result in a large number of actual requests. Moreover, in s3, each read or write operation on an object is considered as one operation, which may result in significant overhead (combined with the read and write restrictions of s3, this is the main reason for the poor performance of s3, which will be discussed in the next section).
When using the list operation, the number of objects listed in each request is the number of parent directories, so the deeper the directory hierarchy, the longer the request.
In the official version of s3 buckets, even if there are no objects to delete, logical deletion markers will be written into the index, which slows down the query of large directories.

Performance Issues with s3a#

The official documentation of s3a provides a comparison between s3a and HDFS in terms of performance. Here is the comparison:

In summary, the following reasons contribute to the performance issues of s3a:

IOPS is limited due to bucket sharding.
Different types of AWS EC2 instances may have different network IO restrictions.
The more objects and data there are, the longer it takes to rename and copy directories. The performance of rename() is slower.
Using the seek() operation when reading s3 forces new HTTP requests, which can increase the overhead of reading Parquet/ORC files.

Additionally, it is important to note that AWS s3 has frequency limitations on reads and writes. According to general conventions, when a partition of s3 exceeds 5500 reads or 3500 writes per second, s3 will reject requests and display error 503.
As you can see, there are still significant challenges when using s3 for a large number of reads and writes. In the big data field, the use of s3 is more focused on its cost-effectiveness and cost-performance ratio. If your scenario prioritizes performance, my advice is: ~~run fast~~!

Solution
Of course, in actual scenarios, many choices are often beyond our control. If we must use s3 in some OLAP or other large-scale data processing scenarios, there are still some optimization suggestions.

Use s3a committer. Hadoop provides multiple s3a committers to optimize the submission of s3 files. The core idea is to use s3's multipart upload mechanism to accelerate file upload. Different committers also have different optimization ideas. For details, please refer to this documentation.
Optimize some parameters, such as configuring the number of threads and connections, increasing the size of block reads, etc.
For more optimization scenarios and details, you can refer to the official documentation.

Conclusion#

Overall, s3 itself still has advantages in certain scenarios. As an important part of the AWS ecosystem, people value its security, reliability, cost-effectiveness, and the cumulative advantages brought by its interaction with other AWS cloud products. However, we must also admit that using s3 in inappropriate scenarios can still have significant side effects. In actual production, there may be various factors that prevent us from completely solving the problem, but at least we can understand the product features and optimize performance as much as possible.
Therefore, I have briefly written about some common pitfalls of s3a and how to take measures, most of which come from the compilation of official documentation. I hope it helps you~

bladedragon