Overview

This blog post discusses the AWS S3 Select feature. It discusses why you might want to use it and provides Java sample to code to work with S3 select feature.

A bit of history

Amazon S3 was released in March 2006 and its one of the top 10 most widely used AWS services. When it comes to storing large files, I can't think of any service I'd use over S3.

Most organization don't operate solely in the cloud. There's still a lot of data that's on-prem. S3 is a great way to move data from on-prem on to the cloud. We use S3 to transfer data we collected each day, send it over to AWS via S3 and run various kinds of analytics on the data.

Anytics and more analytics

Even if you have great network connectivity, tranferring large amounts of data everyday adds a significant time to the overall processing time. This means you shold try to transfer data just once such that multiple analytics that you want to run can refer to the same data that has been uploaded. This may mean transferring some extra data which may not been needed by one analytic service but is needed by another.

The idea of collecting related data together may increase read time of application. What I mean by this is, let's say we are transferring book informations stored in CSV files using S3. Lets say one analytical process calculates the total sales per day. The data that it operates on looks like:

bookid  cost    quantity
------------------------
10034   $10     12
20104   $4.90    5

Let's also assume we have another analytical process which calculates the what type of book are popular by analyzing the summary of each book sold.

bookid   quantity   excerpt
---------------------------
10034          12   It is a group of children who see - and feel - what makes Derry so horr ...
20104           9   First published fifteen years ago, shortly after his death, inside this collection ...

In order to reduce the size of data, it makes sense to combine the two datasets. It also helps to have related data together and reduce joining across datasets. The combined set will now look like this:

bookid   cost   quantity   excerpt
----------------------------------
10034    $10     12        It is a group of children who see - and feel - what makes Derry so horr ...
20104    $4.90    9        First published fifteen years ago, shortly after his death, inside this collection ...

There is just a slight problem we have introduced which is each procss will now also have to ready a small bit of data it is not interested in that is the sales calculation process will have to ignore the excerpt and the topic popularity process will have to ignore the cost column.

AWS introduced a nifty feature in Nov 2017 called S3 select. It allows you to query CSV, JSON and Apache Parquet files using simple SQL statements. Using this feature your services can only pull out as much data they need using filters (WHERE clauses) and by selecting only the required columns.

Sample Java code

The Amazon AmazonS3 class has a method called selectObjectContent() which allows running SQL queries to select data from S3 files. The code below demonstrates running a query on a csv file stored in an S3 bucket:

private void runQuery(Regions regions, String accessKey, String secretKey, String bucket, String objectKey, String sql) throws Exception {
    AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);

    AmazonS3 s3client = AmazonS3ClientBuilder
            .standard()
            .withCredentials(new AWSStaticCredentialsProvider(credentials))
            .withRegion(regions)
            .build();

    SelectObjectContentRequest socr = new SelectObjectContentRequest();
    socr.setBucketName(bucket);
    socr.setKey(objectKey);
    socr.setExpressionType("SQL");
    socr.setExpression(sql);

    CSVInput csvInput = new CSVInput();
    csvInput.setFileHeaderInfo("Use");
    csvInput.setFieldDelimiter(",");

    InputSerialization iser = new InputSerialization();
    iser.setCsv(csvInput);
    iser.setCompressionType("NONE");
    socr.setInputSerialization(iser);

    CSVOutput csvOutput = new CSVOutput();
    csvOutput.setFieldDelimiter(",");

    OutputSerialization oser = new OutputSerialization();
    oser.setCsv(csvOutput);
    socr.setOutputSerialization(oser);

    Instant startedAt = Instant.now();
    SelectObjectContentResult result = s3client.selectObjectContent(socr);
    InputStream resultInputStream = result.getPayload().getRecordsInputStream();
    BufferedReader streamReader = new BufferedReader(new InputStreamReader(resultInputStream, "UTF-8"));
    String line;
    int matchedRecords = 0;
    while ((line = streamReader.readLine()) != null) {
        System.out.println(line);
        matchedRecords++;
    }
    Instant endedAt = Instant.now();
    System.out.printf("Got %d records in %d ms\n", matchedRecords, Duration.between(startedAt, endedAt).toMillis());
}

The full source code to the sample application is available at GitHub.

Performance

How does S3 select actually perform? I was pleasantly surprised to find that S3 select performs very well indeed. It took S3 select approximately 8 seconds to find 2 matching records out of a total of 1 million records in a file of 1.5 GB size. This is quite impressive. The same file takes about an entire minute to load in Microsoft Excel after which it takes 1 min 45 seconds to find the two matching records on my latop which has 16 GB running an Intel 8365U processor. GZIPing the 1.5 GB CSV to a 350 MB compressed file did not reduce the performance which is great - you'll spend less time transferring large files over the network.

Limits

S3 select does come with some limits, detail are available here.

The maximum length of a SQL expression is 256 KB
The maximum length of a record in the input or result is 1 MB

Conclusion

S3 select is a very nifty feature which makes using csv, json and Apache parquet files very easy. You now have ways to filter your data vertically and horizontally right at the source (S3 bucket) itself. Though the type of operations are limited, its still a very useful feature to have.

References

Sample code in GitHub
AWS documentation
Querying data without servers or databases using Amazon S3 Select

Categories: AWS (5)

Tags: S3-Select(1)

Comments