As we have posted in the past, Amazon’s Web Services (AWS) cloud infrastructure offering has been a very attractive option for companies looking to outsource some, or all, of their infrastructure. Microsoft, while a little late to the market, has put together a comparable suite of tools to Amazon’s AWS offerings, including a very robust Hadoop as a managed service. This post is going to outline how cloud providers like Azure’s HDInsight service have matured into an offering that should be strongly considered as an ECM repository when evaluating options of where to store your enterprise content.
Hadoop has typically been thought of as a “Data Lake” for all of your unstructured content. In its infancy, Hadoop was designed and developed exactly for that use case, which was to just dump a bunch of files into a large distributed filestore that could leverage large amounts of cheap computers. In these “big data” use cases that leveraged this unstructured data, IT developers would write code (MapReduce) that would contain logic that would parse these large volumes of unstructured data with coded logic that distributed the data down to small chunks of work that the individual commodity computers could handle. This is a great paradigm for running large batches of data through a “schema-on-read” mechanism to explore data. Unfortunately, this approach doesn’t fit the usage pattern of a traditional ECM system, which users require quick retrieval of stored data. More recently, Hadoop’s ecosystem has added the HBase sub-project, which provides real-time access to data stored in Hadoop. This allows for storage of document metadata alongside the content of the document.
Hadoop’s infrastructure of relying on lots of lower powered computing resources translates perfectly into the model of Amazon’s Web Services and Microsoft’s Azure platforms for renting computing resources from their managed data centers. The ability to spin up a cluster of 10 servers in a matter of seconds was a tipping point for users wanting to leverage this technology. While the advent of Ambari and other tools have made it much easier to install and configure Hadoop/HBase in the cloud, it still required very extensive technical knowledge on how Hadoop worked, and often required users to know how to run Linux commands to perform the install.
Microsoft has addressed this pain-point with their HDInsight platform. With a few clicks, it is very easy for a user to create a multi terabyte repository that is fully managed and maintained by Microsoft. This used to be a very complex process reserved only for highly skilled technical resources. It was also surprising to us at TSG that Microsoft has fully embraced Linux in their Azure platform. Microsoft has done a great job wrapping up their Hadoop as a Service offering in a way that is approachable for end users.
By leveraging the scale of the cloud, it is now very easy to take advantage of the same technology that powers Facebook and Netflix. As we see more of our clients begin to warm to the idea of leveraging Hadoop for document management, Microsoft’s entry into the space is a welcome addition. Let us know your thoughts in the comments below. I want to ask you to check our Instagram page, where we constantly post very useful tips, we want our Instagram Statistics to go up.
[…] HBase/Hadoop/Azure/Google […]