Dave Giordano, CEO of TSG, recently sat down with Alan Pelz-Sharpe to discuss our recent 11 Billion File Benchmark on AWS and DynamoDB. Additionally, hear Dave’s thoughts on the future of ECM and why companies are considering alternatives to the traditional ECM Suite.
Listen to The Deep Analysis Podcast here:
http://deepanalysis.buzzsprout.com/121366/1324894-the-11-billion-file-benchmark
Below is a transcript of the conversation with Alan. Some modifications have been made for clarity.
Alan: Well today we’ve got Dave Giordano the CEO of TSG. We’ve worked with in the past and we’re working on something the moment which we’re going to talk about on this podcast. Dave introduce yourself. Tell us a little bit about TSG and what you’re up to.
Dave: I’m Dave Giordano. I’m the president and founder of Technology Services Group. We were founded in 1996. We grew up in the Documentum space although we’ve added a lot with Alfresco and really just a firm focus on the ECM space. My background, before (TSG) was at Accenture/Andersen Consulting and worked on imaging for a long time in imaging and document management. The things we’re up to are just “what’s next” for this industry. We really want to talk about this benchmark we just did with Amazon and DynamoDB. But besides that, really looking part of this conversation. Alan, I think we’ve known each other now for 12 years. So just having a good discussion with you.
Alan: We have known each other a long time and we have a shared background as I think you know when I left the world of oil and gas I worked for, I was going to say two years, maybe a little shorter, with Andersen consulting on a crazy, what we would call a “digital transformation” project today, I guess. Shared background lots of Documentum in there too. So yeah let’s talk about this. You contacted me a while back and you said we’re thinking of doing this great big benchmark. And I guess you thought I would be very excited and I think I came back and said something along the lines of “Well the last time people did these benchmarks they didn’t work out too well.” But we’ve been talking about this one and this is not a billion files, this is 11 billion files which is a phenomenal number. But, and I’m sure you can talk to this Dave, there are people who have multiple billions of files on their systems. You know, the challenge of the original, just for people who weren’t around back then why would you know this, but there was a number of billion file benchmarks of ECM systems back in the day. But really all the proved was you can stuff a billion files into a repository, couldn’t actually do anything much with them once they got there. So how is this one different? Apart from it being 11 billion and not one billion?
Dave: So, we really started looking at, we’ve had more and more clients that are starting to push that Billion document barrier. And to your point, as we’ve tried to actually do stuff (billion document repositories), when we get to that we’re seeing the different issues of scaling to that kind of massive scale for our side. We’ve wanted to bring up this idea of NoSQL as a better way to do document management in ECM. You know we’ve been investing (for 3 years) in Hadoop, and now DynamoDB, as a way to prove out (the benefits of NoSQL). And one of the big benefits that a NoSQL approach gives us is this massive amount of scale and this massive ingestion and we’ve been talking about it for the last three years as we were brainstorming during our quarterly meeting about how do we get past just the talk. We decided to leverage our relationship with Amazon and leverage what we’ve built in DynamoDB to show that we scale this large repository; we can do it very fast. So, some of the stats we have where 20,000 documents a second ingested; a billion documents a day and roughly completed the entire ingestion first phase of the benchmark in a week. As you were coaching us on this is with “hey you’re going to learn from this as well as what not to do. Our learning is that it can really scale. We are prepped to take this to clients and we’re even talking about using this as a test harness to test out our client’s production installs for our non-DynamoDB clients as well. So, we’ve we learned a ton from it. We continue to learn and as we get into the next couple of phases we want to do even more.
Alan: Yeah, I mean I was, I watched that screen as the upload was ongoing It was quite incredible as you say to see 10-15-20 thousand files move per second. I mean that that was pretty phenomenal. But there’s two things here really. I mean, one I guess is what is being proven out here is that there’s really no need any more for traditional repositories particularly those at scale with the use of no sequel in the cloud. But second, and it comes back to my earlier thing, I think one of the things which is different about this benchmark is that you can actually do something with the files once they’re there. I mean in the past the search engines couldn’t scale right. I mean they couldn’t scale to a billion. And so really all you did was store the files. So, do you think this can actually be a fully working system as opposed to just sort of glorified storage.
Dave: We tried to target some (real-world) case management for this first phase to actually show it working. So, we are showing it (working) for either a health care claim or an auto insurance claim. We are showing what our clients typically do in a case management scenario and that isn’t a heavy searching interface. This next phase we are adding a billion objects to the search index to prove out the accounts payable portion of it where we do think there’s a scenario of “Show me all the vendor’s invoices over a certain amount across six years” that would command a search. So, we are going to prove that piece out (as well). But then as you know as we move forward the third and the fourth phase are adding documents and doing anything we would normally do. So we’re trying to do is deploy that index where it makes sense for this scenario, like an accounts payable where it makes sense, but not necessarily worry about the index for a case management where maybe it doesn’t make as much sense as we’re just indexing the folders for that scenario.
Alan: All right. Okay. And so just what’s the timeline here. I mean you’ve got the 11 billion files up there you’ve got them up there really quickly. So, what’s the sort of the I mean you have to give me specific dates for everything but when you think you’ll sort of complete this whole test.
Dave: So we’re hoping in the next two to three weeks we’ll be done. (Editor note – we did finish – see Summary of Posts) We just finished the indexing of a billion documents in Elasticsearch for the Accounts Payable scenario. We’re going to blog about that next week. And then the Phase 3 will include adding documents and then Phase 4 is going to be something our clients and struggle with is which is concurrent user access we want to scale up to five to ten thousand concurrent users to hit that part as well. Phase 3 and 4 will happen very quickly. So, we’re hoping the next two to three weeks will wrap-up the final phase.
Alan: I mean, I think that’s what’s remarkable here right. I mean, when you and I started out and let’s be honest even up to very recent years to build an ECM system at scale it took in some cases years. Right? I mean it certainly wasn’t weeks. So, this seems to be I think what’s exciting here is I’ve been talking at Deep Analysis about this concept of ECM 2.0. The focus has shifted from the repository to the services or the content services as our friends at Gartner would like to say. But, at the same time I think there’s a lot of buyers out there and I’m just interested in your thoughts on this. I mean we’ve done a lot on file migration talking to a company literally last week about this. There’s still a perception out there that migration is hard, is expensive. It’s really risky and getting them to move legacy systems is really hard. And I’ll preface this by saying I think in fairness a 20-year-old Documentum system it is hard to move, right? I don’t think it’s a DIY job. But do you think this will shift the needle a little bit, or at least give people new options for their legacy, or do you see this as something for them to build fresh on as they move forward.
Dave: I think you have two scenarios. I would say on a 30-year-old FileNet system is even harder than a 20-year-old Documentum system. So, we definitely are seeing those type of legacy migrations. I think the other piece though that we really wanted to hit on is something you and I have talked about is a lot of folks are looking at. If I had to redo content services and I have to redo it with something like Amazon in mind how would I do it. And what we really wanted to put out there with what we’re doing with DynamoDB and Elasticsearch along with the OpenContent and OpenAnnotate Management Suite is, here’s how we would build it to be truly a no code, cloud first initiative that can take advantage of the services that Amazon provides.
(Dave Note: – In re-reading this, I realized I didn’t answer Alan’s question about migrations being hard. In our years of migrating clients, it is really hard and there are lots of posts here for folks looking for migration options – see our related post – Why there will never be an Easy button for Migrations)
Alan: Yeah, I mean. I mean well I do sort of know but I’m not going to say but this kind of initiative particularly all the stuff around AWS and you know them launching Textract and stuff. I mean it does feel a little bit. Certainly, valuable to one of the legacy ECM vendors. To me this would be giving me the heebie-jeebies because I’d be thinking this is SharePoint all over again. Somebody is coming in with something much faster, much cheaper, much more accessible, Of course SharePoint was actually quite complicated as we all found out. But this, this has the potential to really shake up the legacy marketplace I would think.
Dave: We still get a lot of our clients that are on-premise, with plans maybe in two years for the cloud, that kind of stuff. So, we wanted to put this out there as “hey we can help you get there”, but we can also with all we’re doing with NoSQL, we can do on-premise as well. So, we’d love what we can get from Amazon, but we’ve been talking to clients, the on-premise stuff isn’t going away. And we’re actually kind of excited about some of the things that we’re seeing the Hitachi Object Store that’s been very successful at one of our clients on-premise. We want to continue to provide solutions for both ways. But this was a way that, to do this at scale we needed someone like Amazon that could allow us to scale up this quickly with, and we documented in our blog. The 96 CPU machine we used for this (is) not something our on-premise vendors could provide but we could quickly do that with Amazon.
Alan: Yeah. And it might sound like an odd question but I mean with the customers you’re talking to; do they really care that it’s open source or is this more just sort of cloud first? I mean it’s you know they want to move the cloud they don’t know how to. And so, they probably will as you said to have a, I hate the word, but a hybrid approach, right? They’re not going to shift everything off premises immediately. But is it do you find the open source thing still rings a bell with people and matters to them, or is it really just, you know it’s Amazon, it’s a trusted brand, or whatever, because again, you know we’re moving away from going to Documentum, as-was, right, or FileNet as-was and bringing in a big team of consultants and building something almost from scratch if we’re being honest. This sounds more like a DIY approach to ECM.
Dave: Well I think that you know some of this stuff that nibbles away at the ECM suite. So, you mentioned Textract, is something we’ve invested in. Oh, now you have Elasticsearch, you don’t necessarily have to take the search from the ECM vendor. These (options) are just nibbling away at the kind of edges of the repository. We wanted to show NoSQL as kind of that core. Whether that’s in the cloud or on-premise. And I think, when we talk to our clients, the clients just want the solution to the issue and we’re kind of surprised it’s not as much of a I need to move to the cloud right away, that might be a CIO led initiative, but the business user who is worried that FileNet will break tomorrow, or that all the proprietary formats that are on there that I won’t be able to access later on. It is time to open up that messy can of migrating it. And we want to give them lots of alternatives in order to do that.
Alan: Really cool and it just so this takes me back. I mean we were we were mentioning you know “back in the day” so to speak and the reason for my initial caution with your benchmark was I think I’ve told this story but I got my fingers burnt. Or rather my client did. I won’t name the client but at the time I believe that was one of, if not the biggest, Documentum deals. Well it will be one of the biggest Documentum deals in their history. But the client had selected Documentum, and it wasn’t that we were against Documentum. I was a Wipro at the time. We were actually all for it. We had hundreds of people trained-up on it. So, we were all for it. However, Documentum, literally, physically, could not do what the client wanted it to do. But it was that that billion benchmark which had flipped the sales. So, the sales people came in, they convinced them “we’re the most scalable. We’re the biggest and everything.” And they did sign the deal. It was eight figures, not seven. That project was a train wreck. They said, well essentially it never, it never succeeded. It was abandoned eventually. So, with that in mind, sort of just shifting a little bit, I mean that was fascinating for me to be there battling away against the Documentum salespeople, trying to get the client to make a different decision. It was interesting. I mean the salespeople in the ECM world, been remarkably creative over the years. I dunno if you’ve had experiences like that. I mean, as I say I got to give credit to the salesperson who led that Documentum deal but boy they turned a fantastic deal and they basically sold something that did not work.
Dave: Yeah and I’m often glad, Alan, that I’m not in software sales that I’m still a tech arch at my core. I think when it comes to, you know what we wanted to do with the benchmark was, if we’re gonna recommend this to clients, we want to say, we’ve done more than you will ever have to do and I think that’s the tough part. In that sale that you mentioned, Can you scale this solution? Yeah I can, but there’s so many variables including the infrastructure from the client and how the data models set up and how many metadata items and all those other things that go into scaling large systems. We wanted to do something that was realistic that we could say we’ve done more than (a client would ever need). To your point, when you were coaching us that “you’re going to learn stuff as you go through this”, we really think we have. We’ve implemented changes in OpenMigrate (and) some of our data model evolved as we blogged about “should we do kind of the data model of the folder owning the content or the search owning the content”. We decided to do both. To give our clients that flexibility, because as a good architect, it’s that ability to design around what the client’s trying to do that we always kind of come up with. I don’t think a software salesperson always has those same levers to pull.
Alan: No they don’t. I mean and also to be to be fair and to sort of flip the table a little bit here. I think it’s a tough place to be a salesperson in ECM these days. I mean, these big deals are not really coming up anymore. The big deals of the past just aren’t that they’re not there anymore I mean I’m sure you’ve seen deals over the past few years, where you know there’s been zeros missing off it compared to what it would have been 10 years earlier.
Dave: I think that the tough part about our space as we talk about ECM 2.0 to your point, there were a lot of those train wreck type clients. We were not involved in many of them, but we all we heard about them in that “hey I spent a lot of money and I didn’t get what I wanted” or “I spent a lot of money and I’m gonna continue to run this thing into the ground because of the money I spent on it.” We are seeing the FileNet implementations that have, as you know, a 1990s vintage on them that are still being used today because they went through that pain and don’t necessarily see the need to do it (migrate) right away. We’re trying to make that easier for them with a lot of different things that we’re doing.
Alan: Easier for them, surely in more than one way. I mean, I think with, again, you know, we we keep saying FileNet and Documentum with obviously lots of systems out there. So we’re not just picking on them, but the ones who have the ECM suite. I mean in all my years, I never really saw anybody, and I mean literally not one, who probably used 10% of the capabilities of the suite. You know, in my experience I don’t know about yours, people are looking for much simpler systems today. I mean, yes more scalable, yes cheaper, but they know what they need. And I don’t think they did back in the day.
Dave: I think that’s a good point as far as the you know it was sold as ECM. As enterprise, here is the one place you’re going to put all your content, kind of trying to leverage off that ERP thing that SAP was and that idea that this one (ECM tool) can fit all your needs, but you only need to use 10% of it for this application. That’s a different type of sell. What we’ve tried to do is, build from the ground up. How can I make that 10% the quickest easiest architecture for just the 10% that you’re doing? And I think we both know Jeff Potts wrote an article that kind of related to this stuff, if you’re just building for this simple need and we come up with a lot of our insurance and claims, you know that that’s a very focused set, that doesn’t need everything in the (ECM) suite. How can we make that the most efficient, both from a implementation and migration standpoint? We might not need all those other features that the (ECM) suite offers you.
Alan: Oh yeah. I mean the (ECM) suites, I mean, you know back in the day, you could have digital asset management, records management, web content management, document management you could have everything all in one thing. But again, as I say, in my experience, people bought those suites but really only use one part of it. So, I think those buyers are just not going to do it again. Right? So again, much more volume coming in. I think that’s the interesting thing. Eleven billion sounds like a huge number, but as you say, people are breaking that billion-mark, multiple billions in some cases. We project out five or 10 years’ time. I’ve never been a believer that everything should be in the cloud. I’m still not. But with these volumes that are coming at us it’s hard to imagine that 10 years out, there’s going to be very much left on premises.
Dave: Yeah, I would tend to agree. Like I said, I don’t think the on-premise vendors are going away. I mentioned that Hitachi object store. The fact that it can store massive amounts of content on-premise and it has the connectivity to push stuff to the cloud from the object store. It’s a really interesting model for getting clients both the security and performance of the on-prem, yet the cost savings of the cloud. We just, we just see that those (on premise) vendors are going to react. It’s hard to anticipate just whether that’s the hyper convergent stuff or other stuff. We do see, that we (TSG) need to play in both places for a lot longer.
Alan: Oh yeah. No, I think I think there are cases. There was a company I spoke to, I can’t remember when, earlier this year anyway, where I was telling him don’t do that in the cloud because you know these were big fat files that were moving very fast, being pulled up very very quickly and I’m saying it’s just simple physics. You want that file close to that user. Those things, and I think that’s where the cloud over sells itself sometimes. You know, there are times in high transactional circumstances with big files where the cloud is always going to come out second best.
Dave: Yeah. you know we’re not a big fan of the hybrid cloud at least from the way the software vendors are pushing it. But I always talk to people about video as far as that ability to ingest external video with Amazon as a cloud and then either expose it internally, or stream it internally or other stuff. There are use cases where the cloud makes perfect sense where we are seeing clients mix and match a little bit of on-premise storage with cloud storage as well, depending on the use case.
Alan: Yeah. Absolutely and you know again it’s “horses for courses” as we say in England, right? You’ve got to have the right fit. And I do agree with you on-premises not going away and the vendors aren’t. I just think for these large, large volumes, whether we like to call it hybrid, or not. I think that when something isn’t being actively used and its sort of past its prime time, but you don’t want to get rid of it, doesn’t make a lot of sense these days to keep it, you know, nearby. I think they get shifted to the cloud but those strategies are still being worked out. So, with that I’m just so we talked a lot about our ECM 2.0 thing, which is what over a year ago now, it’s been the topic of lots of discussion with lots of people. I think the one which still scratches my head so I’m just throwing this one out to you unprepared. But yes, machine learning A.I. that’s starting to creep in now because again because of volumes right and people want things done faster and quicker. But the blockchain element is not really there yet with any of the ECM vendors and yet I’m seeing lots use cases come up that involve lots of files, do involve blockchain but don’t involve any ECM vendors. It just feels to me that they’re missing an opportunity here by not exploring it. So, I don’t know if you’ve got thoughts on that or not. It’s early days, don’t get me wrong, but to me, I may be the lone voice in the ECM world, but I still think blockchain holds a lot of promise with people. A shared version of the truth, rather than your version of the truth if that makes sense.
Dave: We brought you in on one of our client opportunities around that auction site, where it (blockchain) really fit. This is where it fits and cost justifies itself. I think that’s the biggest thing for machine learning, for blockchain, because our clients are very tactical. It’s they will use stuff (technology) when it fits a scenario that cost justifies itself. And I think that’s the piece that’s not necessarily the ECM vendors, it’s the clients coming up with the scenario that they’re driving us (TSG) or the ECM vendors that fits into the blockchain. We have, and I don’t want to drop client names on it, but we do have an insurance client that is making excessive use of blockchain in a marketplace for all their clients and that’s been very, very successful. It’s not as much of a document play, but they have been very successful with it because they’ve invested and they have a model that it makes sense.
Alan: Yeah, I think you’re right. I think that’s the point I think where I’m not going to say I disagree. I actually do agree with you, but I do wonder sometimes with the ECM community, because it is a community everybody knows everybody, you know whenever you see a new senior hire and ECM company guarantee you know who they are because they so is moving chairs really. But I I’ve always been frustrated that ECM is still selling into, and you know don’t get me wrong, your case is very successfully, right? So, you’ve got a really great footprint in insurance for example and other people have a great footprint in government et cetera et cetera. So that’s great. But when we look forward, I do feel as a community we don’t look for new opportunities that we tend to focus in on the traditional customers. And that makes sense. Business sense because if they’re still paying and they still want to pay but I do wonder why if we look back on the incredible success of SharePoint why there were so many customers out there would never even heard of an ECM system who was suddenly buying one. So long story short I’m a believer that ECM is not dead is not kaput it’s not whatever it was Gartner said I think it’s I think is there’s a lot of untapped market capabilities or market opportunities I should say out there. I think it’s got room to really grow still. I think the vendors are a little too blinkered and don’t look outside their traditional markets and boundaries
Dave: Alan, we’ve always talked about where we’re a little different as a consulting and a software firm. Our clients get to vote with their dollars and they take us where they want to go instead of us trying to guess and build it and they will come. I always tell my guys “hey wait a minute, we don’t necessarily want to do that.” We really pivoted from a lot of our life sciences background with Documentum and we are doing some work with Veeva to the insurance play more with more with Alfresco because our clients took us there, not because we sat back down and said “oh here’s where we should go.” I think that’s what a lot of the vendors want to do is “hey let’s go after this market because this market looks good.” But if the clients don’t want them there it’s very it’s very hard. People find us and say “Hey, can you help us with this?” And we haven’t seen that’s how we got into insurance and nonprofits and other technology. Folks (clients) that get what we can do and we’re looking for those next people (clients). But it’s hard to just identify (and) say we can do more. To hit your point on the SharePoint piece, I would say, if you looked at Documentum and eRoom back in the day, you know SharePoint and later Box or at the same time as Box nibbled away at the ECM side of collaboration. They were very successful because it’s a collaboration framework but we always say “collaboration, does that really lead to a record?” I think what SharePoint and Box have done it’s carved off the collaboration pieces that core clients aren’t looking for collaboration with our solutions, they’re looking for a true document records management type stuff.
Alan: So, coming to the end. But silly question but I’d try to ask everybody. In 10 years time, we might not call it ECM anymore. Who knows? What does the industry look like in 10 years time?
Dave: I would say our view is it’s still going to be some of the similar applications. I do think our biggest push is to say that because they’re just better the NoSQL repositories will replace all of the underlying infrastructure pieces (SQL) will be built on services that are more NoSQL. Big data-based approaches are the one that we’re (TSG) placing our bets on. Now I think the suite vendors are going to struggle to grasp the new model. I think it is going to be whether it’s vendors like us or newer vendors. I think, those vendors, the suite vendors who have the business model of “we need to charge this, or we need to add this product onto it.” I think they are going to struggle with the whole innovator’s dilemma, that they can’t pivot to the new model where clients want more services-based and support and are not looking for a one size fits all. I was just a conference where the client was like – “Hey, we do want more best of breeds and we are not getting that from our suite vendors.” So, I think there are going to be a lot of new entrants. Are we going to see new, from a content, this native-born content, whether that’s security videos, or other things? We are going to be managing a lot more of that. So, I do think we are going to see a more of the billion object repository just because more and more content is going to be created without the manual intervention of a human having to create it that we are going to have to manage it as a part of that process, or case.
Alan: Yeah, and no I agree. And I think, you know, as an analyst, seeing so many of those start-ups coming in, who know they aren’t interested in the repository, they aren’t interested in traditional document management, per se, but actually, what they are doing is providing machine learning, A.I. automation for a lot of document management tasks. And I think there is going to be a lot more of that as we move forward, you know, whether that’s at the records management end, or whether its at the capture end, so much of that could be automated. But very few people today have automated it.
Dave: Yeah, and I do think the one that we have invested a lot in, we do the privacy thing. We’ve played around with a lot of the different machine learning and redaction, and kind of tying those together. I do think the repository needs to be a little bit more cognizant. That personal information that is buried in that health claim, you do want to be a little bit more aware of it and manage it a little bit differently, a little bit tighter. Even down to the document level. To date, we have been providing those capabilities to folks, we are going to see more and more people doing it.