The hidden data costs threatening enterprise AI plans
The AI revolution may be underway, but it potentially comes with significant costs. While the costs of cloud services are generally well understood, the granular costs may not get full attention.
Chalan Aras, the SVP and GM of Acceleration at Riverbed, says many organisations don't realise how much data they store in the cloud. While on-prem data storage is easily measured, the disparate and open-ended nature of cloud storage means many organisations hold much more data than they realise. And the distribution of cloud storage can proliferate as departments procure their own services.
In addition to the storage costs, there can be significant costs incurred when data needs to be moved, either between cloud providers or even across regions with the same provider.
"I was recently working with a client in financial services that needed to move 30 petabytes of data between two cloud providers," explains Chalan. "The cost of egress can be a shocker. I've seen cost of a petabyte of movement reach up to $90,000."
Chalan also says there is the hidden cost of what Riverbed calls the 'double bubble'. When data for AI projects is moved for training or inference purposes. there are GPUs at both the orgin and destination. The GPUs are expensive and there is typically a commitment to use them once a contract is entered. When the data is moved, which can be costly, the company is stuck paying for the GPUs at both the origin and destination.
Another challenge when it comes to accessing large volumes of data for AI initiatives is network architecture and topology. While urban areas may have access to 100Gb backbones, network overhead and other constraints may limit movement to just 20TB of data per hour. Chalas says this can lead to unacceptable delays.
"For many organisations, the most data they can move in a day is about 3TB. If you need to move 20PB, something we commonly see in many industries today, it might take 20 days for just 1PB. With some of the larger organisations Riverbed is working with, we see volumes of data that can take up to nine months to move. That's an unacceptable delay," says Chalan.
That becomes even more complex when considering multi-cloud environments. Chalan says it's relatively easy when all the data is in one place but when it's distributed the challenges amplify.
Moving large volumes of data as quickly as possible often looks like it can be solved by providing additional bandwidth. But Chalan says this is not alwsys the best solution.
"If you're within the same nearby region, more bandwidth can help. But if you're not that close by, more bandwidth doesn't help at all. In fact, it may make the problem worse because we have more traffic and then latency starts overwhelming the bandwidth. You must ensure every part of the system is built for that performance. Some of the latency might be because of storage or the cloud that you're using may throttle your activity because you did not specify the right amount of throughput expected or you're not using the right set of instances," Chalan explains.
The complexity of AI applications means the way performance is measured must take an end-to-end view. While availability of bandwidth is important, it's the equivalent of having dial tone on a telephone. The goal is to make a call or complete the request or transaction in a reasonable time.
"The solution should focus on that on the outcomes that that matter to the enterprise and not the raw data underneath. Design choices must translate to tangible outcomes," says Chalan.
Chalan says there are some best-practice guidelines organisations can adopt when designing data pipelines for AI projects. A key consideration is understanding whether the data architecture needs to support a long-term initiative or a short-term data migration. And before committing to a large-scale infrastructure project, look at how you can optimise the network you already have.
"Don't try to change your architecture too much," Chalan says. "Make things work. That's where, for example, our Data Express product can help accelerate those first steps."
That can give architecture teams valuable insights into what they need and what already works. Armed with that knowledge, organisations can consider changes to the network, storage, visibility, and other parts of the chain to optimise systems to enable their AI projects to operate at peak performance.
When it comes to network topologies, Chalan suggests not making too many big bets early in the journey. For example, he suggests connecting AI tools to data sources using VPNs as a low-cost solution that can be used for proof of concept before choosing a more costly or complex solution such as direct connection or interconnection. The use of virtual appliances is also a step worth considering in his view as that gives an entry point that can be later optimised with physical appliances.
The trick, in Chalan's view, is to not overcommit until you have a good understanding of your needs and capability. Focus on the outcomes your organisation requires and consider every element of the platform so that you minimise the risk of performance being bottlenecked by an element that can't be remedied.