Data Ingestion Strategies for Media Streaming Businesses Grappling with Analytics
To introduce the subject at hand, we would like to start this article with an obvious statement: over the past eight years, big data has clinched a place for itself in our world. We have seen some companies reluctant to take the plunge, but bit by bit even they have increasingly moved their data into the cloud. And more recently, companies like ours have emerged, companies thoroughly conceived to collect data, parse and normalize, slice and dice, assess and analyze it, all to put it to work for us.
The voluminous supply of (previously useless) data we had locked away and gathering dust in our data centers now promises to become the goose that lays the golden eggs. Everyone is hoping to cash in.
How? Essentially, by turning our data into knowledge, taking our data and polishing it until it reveals important insights about our customers. The greater our insight about our customers, the more personalized a service (or product) we can offer them. Personalization in turn leads to greater trust and influence over their decisions.
But, before we get to the well-polished golden egg, we have to start at the beginning: moving data around. Surprisingly (or maybe not), data movement is not most companies’ biggest focus, but it fast becomes a growing headache.
Relatively young media streaming businesses have a huge advantage in the sense that they are in their early days and the data they have to move is still incipient and just starting to grow, not like companies who have been around since the 60’s and have entire silos of data in some form or another that they now need to move.
The large scale success stories are well-known: Netflix, HBO, Amazon Prime, and most recently Disney+. Traditional media companies are also now expanding their roots and building their own streaming businesses (NBC Universal, M7, ZDF). There are also important niche market players (Pureflix, Teatrix). These companies all share at least one thing in common: the process of making their data profitable involves gaining insight into their customer consumption styles.
That is where Jump fits in. We were one of the first to notice that the new players in the entertainment industry were preparing to battle it out over content. They were focused on being able to offer customers as much content as possible, without focusing on what would really make customers loyal and willing to pay a monthly subscription.
When Jump’s customers move their data into the Jump Data Lake, they benefit from structured data that can be analyzed and easily used for key business decisions. For example, they can determine if the content they have on offer engages their viewers or do they need to shake up their content offering to provide an offering that viewers really like.
With respect to the strategy implemented by Jump: from the onset, Jump has been a part of AWS and consequently its data movement strategy can use AWS Services in addition to internally developed software.
Namely:
- API Gateway, if our customers want to push data to us on a daily basis.
- SFTP servers, if they wanted to provide us with historical data
- Our own SDK, which is custom software hooked to our customers playback systems for any kind of device their service is offered on.
- We have custom Scala developments for retrieving data served in APIs
As the media streaming business grows, the ways our customers store data (and where they do so) also expands. Previously, if customers wanted us to collect their data, we only had Scala and it required time for development for each client depending on where they stored their data. We realized we needed something faster and easier to in order to get daily data.
That is why Apache NiFi came into our lives. We recognized that Apache’s mature project catered to 80{25e1698c2f9a0472130d0b738967fee05ea39487d443821ec133845c5d454689} of our use cases. It was a better option to move data into our Data Lake. There are all kinds of out-of-the-box processors to connect with databases, SFTPs, Kafka, Buckets, etc. We saw that the software dominated the market. We took on board both the public version offered by Apache and Cloudera’s private version. This says a lot about our confidence in the product.
We recognized the benefits it would have on development times. Its graphical interface has the power to accomplish any task you want to do. For what we wanted to do, it crushed Scala’s processing. Even the daily data volume complied with standards.
The only problem we saw with using this approach, is that we had to leave a cluster running, and we wanted a way in which we could schedule the development of our flows and run a cluster. For most of our clients we collect their data in a single-day batch process. We developed a custom framework using the NiFi REST API, Bash scripting, Docker and Kubernetes which allows us to automate the installation of a single NiFi node or if the data volume is bigger, to create an up to 10 NiFi node cluster for the data movement.
The diagram below illustrates how we are doing it, initially. Now that we have this in production, we plan to move to Airflow to orchestrate the “manual” approach, really more fully-automating what is already automated.