Getting the Most Of Your Time Series Data
Unlocking the Potential Of Time Series
Posted by Alejandro Rusi
on May 19, 2021 · 11 mins read
In our Delivering Your Marketplace's Orders in Time with Machine Learning entry we briefly mentioned time series. We talked about how a time series is just a series of data points ordered in time. And once we had that data it was possible to predict the duration of the preparation of an order based on previous observations.
But the world of time series is much bigger than that. How do I know if I have time series data in my hands? What can I learn from it? Are they only useful for forecasting? Do I need to be an expert to obtain an insight from time series data? How can we properly store its representation?
In this entry we aim to answer all of those questions and show the potential of working with time series data.
How to spot a time series
How do we know if we already have a time series in our hands? And if you don't, where do we acquire it? There are two important things that can help us recognize a time series:
- It has events happening in time (days, hours or even milliseconds!).
- The order in time in which these events happened is relevant. If we wanted to forecast tomorrow's weather, then we want the recent weather to be more relevant to this forecast than the weather we had a few months ago.
- A new event doesn't replace or update the old one.
In short, if you can keep track of changes over time, then you've got a time series in your hands. For example: say we're developing a Food delivery app. For each user we could simply store their current order in a database, and delete or update its entry once it gets delivered. But users will make more than one order with our app during their lifetimes (hopefully!). If we just stored their ongoing orders and cleared them once they're over, we're losing useful information.
What are their favourite restaurants? When do they usually order food? Do we have any other similar users that order from a restaurant that this person does not? Those are some of the questions we can answer if we store time series data.
Now that we know that we do want to use time series data, storing the information we mentioned earlier would look something like this:
If you wanted to plot the accumulated amount of orders for each unit of time then it would probably look something like this:
Looking at just one day of data is okay, but time series are great for getting a peek at the bigger picture:
Seems like our business had a dip around the end of 2021 and at the start of 2022, but the amount of orders only seems to be growing!
And why stop at Food Delivery? It's not hard to come up with many more examples for time series and uses we could give them:
- Take a bank for example. One of its users always makes small transactions to the same business and accounts during the day. But suddenly a transaction for a large sum of money appears in the middle of the night! If the bank stored every transaction the user made over time they could use that information to detect anomalous behaviour, and then cancel out that transaction or ask for additional security validations.
- Any kind of Marketplace would probably like to know how its performing. How many users are getting to the checkout page but not buying anything? At what time do we have the most purchases? Is it different from the time we have the most visits to our Marketplace? Time series data can easily answer those questions for us, we just need to know how to ask (or how to query!).
These are some of the things we can obtain from our time series. Now let's get into how to do that.
Now that we know what time series data is, let's assume we have access to one of them.
Maybe you've got a
.csv file representing your time series and are planning to boot-up a
Or you might be more of a spreadsheet person.
But it's more likely that you'll be storing and interacting with your time series through a Database.
But why would we need to interact with our data via the Database system? Wouldn't it be better to just extract the data and use a spreadsheet or an interactive notebook? Well, yes and no. If we were to extract the data each time we wanted an insight, it'd be fine the first time but it'd get boring fast. After all, extracting and downloading data takes time. And if we changed our minds and wanted to work with a different data set, then we'd have to run a long extraction and download again! Plus we'd be limiting ourselves to only extracting data that fits in our computer!
Asking a question to a Database is called a query. And thankfully there is a common language for querying most databases: SQL. SQL offers a common way to query different databases. This means that no matter the Database, if you already know SQL then you can transfer your knowledge instantly.
Remember our restaurant example? We had a Database storing each order made by a user throughout time. Let's use SQL to find out the five most active restaurants last week:
SELECT restaurant, COUNT(*) FROM table WHERE datetime > now() - INTERVAL 7 days GROUP BY restaurant ORDER BY COUNT(*) DESC LIMIT 5
A query like that is extremely simple to write for anyone with a little bit of SQL experience.
What some people don't know is that some databases, that are not specifically used for time series, have some pretty nifty time series functions.
For example PostgresSQL has the
lag function, which lets us take a field from a previous row.
And since we are working with time series, this row should be the previous event in time.
lag to see the difference in orders for each month:
SELECT month, order_amount, lag(order_amount, 1) AS order_diff OVER (ORDER BY month) FROM ( SELECT month, COUNT(*) as order_amount FROM orders GROUP BY month ) LIMIT 5
And believe it or not, you can also use a function to find the correlation between the value of two columns:
corr! Let's see if there is a correlation between the amount of orders and the amount of users registered.
SELECT month, corr(order_amount, user_registration_amount) FROM ( SELECT month, COUNT(*) as order_amount FROM orders GROUP BY month ) orders_by_month INNER JOIN ( SELECT month, COUNT(*) as user_registration_amount FROM user_activity WHERE activity = "user_registration" GROUP BY month ) user_activity_by_month ON orders_by_month.month = user_activity_by_month.month ORDER BY 2 DESC LIMIT 3
Seems like there is a high correlation between those two values after all!
Forecasts and predictions
We talked about many uses of Forecasting in detail in our Improving the Three-Sided Marketplace with Machine Learning. What we didn't mention is that time series data is a perfect opportunity for using Machine Learning algorithms: they are fast, effective and easy to tune to your needs.
Feature engineering is key to the performance of your Machine Learning model. The good news is that there are a lot of features common to all time series data. Consider these features next time you want to add new features to your time series forecasting model:
Date and time features are key. Adding the hour or day name of an event is a simple yet effective feature. Using a library like holidays is also a great tool for knowing if an event in your time series happened during a holiday.
Categorize ranges of time that mean something to your app. For example businesses with a happy hour could categorize events happening during those hours as a feature!
Remember lag? Getting a lagged value (an amount from a previous unit of time) was very useful for insights. But it's also a very common feature in Machine Learning models. It helps your model know that events that happened recently will be similar to the one you're trying to forecast. Getting the lagged values for the last hour, last day at the same time and last week at the same time is a great idea!
And what do we gain from Forecasting?
- An easy way to know what to expect in the future.
We can plan ahead around the results of our forecasts. Now we know if we'll need to get more people for a shift, or if we'll need to stock-up a product. Do this either manually or with a Supply and Demand Optimization algorithm!
The day is not over yet, but we've got a good estimate on what will happen in the remaining hours of the day thanks to our forecast.
- Detect Anomalous behaviour.
We can detect that extremely high or extremely low values are outliers by comparing them against our forecast. They are great for detecting suspicious behaviour and creating alarms around your key business metrics. If we had a time series for the money being processed by our business, we could do some anomaly detection thanks to the lower and upper boundaries of our forecasts.
Seems like an usual amount of money is being processed at 12PM! That might be OK though: maybe some of our business is just booming. But the amount of dollars processed goes down to zero at 17PM! Did our service go down? Better make sure by setting up an alarm for when the value becomes extremely low.
- Understand trends and seasonality in your data.
Time series are a powerful tool and allow us to obtain information about their trends and patterns. It's awesome to see if your business had a positive trend in its third quarter, or check if there is a certain seasonality to the amount of sales during certain months.
But why limit ourselves to events that happened in the past? Trends can also be obtained from forecasts! This way you'll anticipate future trends that have not happened yet and check if the future falls under a certain type of seasonality.
Storage and representation
There are multiple ways to store time series, and the correct choice depends on your own personal use case. Are you just starting out and don't have that many events yet? Then it's fine to start out with a simple relational Database, such as PostgresSQL or MariaDB. You'll probably have some kind of Database for storing key business information (registered users, money balance, things like that).
It's OK to store Timeseries data in there when you're just starting out. Plus it's very handy to have both your business related data and your Timeseries data in the same place, as it allows you to join both of them for more powerful insights. And if you want to squeeze some extra performance out of these Databases, then there are tons of ways of tune them specifically for dealing with Timeseries data.
Are you tracking tons of events at the same time from completely different sources? Then maybe a Data Lake is the answer.
A Data Lake is an awesome way of joining multiple data sources together! Be it data from relational databases, user events being tracked by an API or even unstructured data such as text or binary files. This means that transforming and joining all of that data together becomes easier, and allows us to enrich our time series data or use it to enrich some other kind of data.
Even though most Data Lakes are not specifically optimized for time series, their performance is still very good. The best thing is that since a Data Lake is inherently distributed and scalable, it's easy to adapt it to support any amount of time series events, no matter their amount or their granularity.
Even though you can host your own Data Lake, it's often easier to host it in the Cloud using vendors such as Amazon, Microsoft or Google.
But what if your use-case requires you to have the best of both worlds? You need the ease of use of a Database with blazing fast queries, but with the scalability of a Data Lake.
There has been a lot of growth and buzz around time series databases. Their objective is to be the best solution for working this kind of data.
These time series databases know and expect that time series data accumulates extremely quickly. They're optimized for automatically scale to the needs of the user. This means that both data insertion and queries are optimized to be as fast as possible.
This doesn't sound too different from a normal NoSQL database which are usually easily scalable. But another of their priority is usability: most NoSQL databases introduce their own dialect for querying data instead of using SQL.
But time series databases like Timescale make a big focus on usability by maintaining a SQL dialect for queries. Now we can immediately transfer our knowledge around queries from one Database to another!
We hope this post made you a little less afraid and more interested in the world of time series data! You can start getting useful insights from a time series right away. Remember that it's OK to start small by tracking some events you find interesting, storing them in a relational database and using SQL queries and light-weight forecasting methods for gaining useful information.
Once this simple solution doesn't fit your needs anymore, it's time to scale up. A good distributed and scalable system for storing your data is a good investment. If you're looking into forecasting algorithms that are more compute-heavy then beefing up the instance that runs them is a great idea!
Need help deciding which of these solutions fits you the best? Are you looking to develop a reliable, safe and cost-efficient data infrastructure for your time series data? Looking to get some real business results from your data? At MUTT DATA we are always working with clients who have completely different business needs and different types of data, and our solutions are always made to fit!
If this seems interesting to you Contact Us!
Useful references and further reading
There is a lot more to time series than what we've talked about here! There are lots of topics we haven't mentioned, and the ones we did mention can be explored in a greater depth.
These are some of the resources we recommend if you want to dig deeper on time series!
Kulkarni, Ajay. Booz, Ryan. “What the heck is time-series data (and why do I need a time-series database)?”. Timescale Blog, URL
Athanasopoulos, George. “Hyndman, Rob J.Forecasting: principles and practice”. Edited by OTexts, 2018.
Hamilton, James Douglas. “Time Series Analysis”. Edited by Princeton University Press, 2020.
Freedman, Mike. “Time-series data: Why (and how) to use a relational database instead of NoSQL”. Timescale Blog, URL.