Amid a digital transformation shaped by the exponential increase in the creation and consumption of data, companies have realized how important it is to store data efficiently to make the most of it. With a looming recession just around the corner, and cloud computing costs being almost half the technology budget for most companies, efficient usage of cloud services and resources, especially on cloud storage, has become a priority.
The Shrinking Budget for Cloud Storage Solutions
Optimizing Your Company’s Object Storage
Even if you think your business might be unaffected by the looming economic changes, optimizing your data storage has benefits that go way beyond just storage cost savings. Data is an exceptional asset, and when used properly, it can provide valuable insights, improve your products and solutions, help increase profit margins, and uncover hidden efficacies your business can take advantage of.
But to make sense of the millions of data points your business compiles daily, data must be properly managed to allow for faster and more efficient processing. This requires data to be clean, organized, partitioned and stored in a way that is usable and easily manageable for the business in the most time- and resource-efficient manner. This is no longer a nice-to-have option; it has become necessary to keep up with industry developments and the competition.
At Factored, we understand what it means to optimize costs while enabling growth for your business. We’ve worked with the top cloud storage providers in a variety of Big Data, Machine Learning, and Data Analytics projects. We know how important it is for companies to maximize the benefits of storing their data in the cloud while optimizing those costs.
While cloud solutions are almost always a far better option than on-premise ones, we must make an optimized and well-thought usage of the resources and services the cloud provides. For example, poor decision-making can significantly increase storage costs as companies grow.
Let’s see why. Object storage services, such as Amazon S3 and Google Cloud Storage, are billed based on three main points: storage, requests & data retrieval, and data transfer. From our experience, we’ve found three cloud storage considerations that must be addressed to implement efficient and effective data storage cost and optimization strategies: Serialization and Compression, Compaction, and Data Lifecycle.
Serialization and Compression: Same but different
Depending on which format you’re using to store your data, you could reduce your bill. The format in which data is stored makes a huge difference in the amount of storage you consume, and the amount of time it takes when retrieving data from these storage services.
Serialization is the transformation of objects into a string of bytes that saves the state of the object in an easily transmissible form. Depending on the purpose and type of data, there are different serialization strategies. For instance, consider structured data; whether the data is stored in a column-oriented format or a row-oriented one makes a significant difference in the storage costs you end up paying. Let’s consider row-oriented formats, such as CSV. These formats store data row by row, with fields contiguous for each record. When retrieving data, even if you only want to access some columns, the service accessing the data requires to go through each row and read the desired columns’ values. Whereas column-oriented formats, such as Parquet or ORC, organize the data grouping information by column. This lets you easily retrieve just the columns you need for your analysis, greatly impacting how much data is being effectively retrieved and transferred through the internet.
Compression is the reduction in the number of bits required to represent data. Data compression can reduce network bandwidth requirements, speed up file transfers, and save space in storage systems. Although some compression algorithms reduce data quality (e.g., for image, video, and audio), there may be cases in which the final usage may allow for this reduction without major repercussions (e.g., video streaming, images, and voice recordings through instant messaging, etc.). However, depending on the data type, some compression formats let you reduce the data size without losing quality. In the case of structured data, columnar formats also allow to compress data as much as 10-fold compared to a CSV counterpart. When compressing Parquet files, different compression algorithms are used behind the scene. For example, dictionary encoding is appropriate for data with a small number of unique values. Each value is replaced by a small integer and a reference dictionary stores each integer’s meaning. Parquet also uses Bit packing and Run length encoding (RLE).
Compaction: Size matters
You must be aware of not storing too many small files (kilobyte-scale) when storing data. When accessing many files sequentially, you incur in extra costs, both in time and money. Here’s why:
Small files impede performance regardless of which means you are using to access the data. This is because each file has to be opened, read, and closed, causing an overhead. These three steps may only take milliseconds per file, but multiply those by hundreds of thousands or millions of files, and those milliseconds add up. Remember that you are paying for the amount of data being accessed, so if you read a thousand files, you pay for those thousand requests.
This is a common scenario when working with streaming data like that generated from IoT devices. Streaming data usually creates several small files, which can cause headaches when trying to perform analytics over this data. To address this issue the strategy is simple, merging many small files into larger ones. This cuts down the time you spend reading the data because you have fewer files to open and close. This strategy is called compaction. Since there are fewer files, fewer API calls are made, therefore reducing costs on your storage services.
Data Lifecycle: In the circle of life
How frequently you access your data tells a lot about how valuable that data is for your operation. Cloud storage providers offer different storage tiers to match your data’s access patterns. These storage tiers are based on how often the data is accessed and how long it takes to retrieve it. Costs vary for each of these services: The higher the access frequency and the faster the retrieval time, the more expensive the service is. It does not make sense to store non-valuable data at the same tier as the most valuable one.
It’s important to mention that less expensive storage tiers have limitations. For example, the retrieval time for data in those tiers is longer. Nevertheless, this strategy allows companies to save their data efficiently. A use case for archival data storage tiers might be when businesses are subjected to regulations that require them to log and store data for several years for compliance reasons.
Below is an illustration of different storage services tiers common to most cloud providers:
In many cases, companies store their data in the default storage tier (optimally used for most frequently-accessed data), forgetting that there are better and more cost-efficient alternatives to store it. Choosing an option depends on how often the data is needed. For example, many companies have data that hasn’t been accessed in months, and still, they pay the full price for their storage. A cost-optimization strategy would be to move that data to a service that provides a better tradeoff between cost and retrieval time.
Data is not meant to be stored forever. An important part of the data lifecycle is defining when to delete data. There is data that, once aggregated and analyzed, has no use. Most cloud providers let you set policies to automatically remove old data, reducing your storage costs over time.
Don’t leave money on the table by not optimizing how you manage your data on cloud object storage. Remember that depending on the format and how you compress data, you could be saving costs not only for data at rest but also for the amount of data transmitted. Storage services also charge you for the time (and even for the number of requests) it takes to access the data; many small files cause a huge amount of wasted resources. You don’t want to store your data forever; Let it go. Define your data lifecycle based on the data use case and data access patterns (do not forget about regulatory compliance!). Applying these best practices will significantly reduce your bill on cloud object storage services while still running your operation as usual.
Do you want to assess how your company is doing and get tips on how to make every penny count for your cloud storage? Answer the following questions and find out!