How Large Language Models Impact Data Storage

Overview

In the era of intelligent computing, compute is productivity and data is a core production factor. The emergence of large language models places higher demands on data storage.

AI storage products for the large-model era

Huawei recently announced new storage products aimed at large models. The OceanStor A310 deep learning data-lake storage and the FusionCube A3000 training/inference hyperconverged appliance target foundation-model training, industry-model training, and scenario-specific model training and inference, providing storage solutions for these workloads.

OceanStor A310 deep learning data-lake storage is designed for foundation- and industry-scale data-lake scenarios. It supports the full AI workflow from data aggregation and preprocessing to model training and inference application. A single 5U chassis supports up to 400 GB/s bandwidth and up to 12 million IOPS, and the system can scale linearly to 4,096 nodes with multi-protocol, lossless interoperability. The global file system (GFS) enables cross-region intelligent data stitching to simplify data aggregation. Near-data compute supports preprocessing close to the stored data, reducing data movement and improving preprocessing efficiency by about 30%.

FusionCube A3000 training/inference hyperconverged appliance is intended for industry-scale model training and inference. For applications at the hundred-billion-parameter scale, the appliance integrates OceanStor A300 high-performance storage nodes, training/inference nodes, switching equipment, AI platform software, and management and maintenance software to simplify deployment and delivery. The vendor states the appliance can be deployed within two hours.

The appliance supports two commercial deployment models, including a Huawei Ascend one-stop solution and an open model that allows third-party partners to provide compute, networking, and AI platform software. Training/inference nodes and storage nodes can be scaled independently to match different model scales. High-performance containers enable multiple training and inference tasks to share GPU resources, with reported GPU utilization improvements from around 40% to over 70%.

Policy context in China

The success of large language models such as ChatGPT has highlighted the long-term investments required for training large-scale models. At the policy level in China, several national and municipal initiatives have been issued to guide and support artificial intelligence development.

Examples of recent municipal actions in China include: Beijing's implementation plan for building a globally influential AI innovation hub and measures to promote general AI innovation (issued May 30); Shenzhen's action plan to accelerate high-quality AI development and high-level application (issued May 31); Chengdu's draft measures to further promote high-quality AI industry development (issued June 5); Hangzhou's draft implementation opinions to accelerate AI industry innovation (issued June 12); Wuxi's three-year action plan for AI industry innovation development (2023–2025, issued June 14); Shanghai's measures to promote innovation in AI large models (issued July 8); and Chongqing's scenario-driven action plan for high-quality AI industry development (2023–2025, issued July 25).

During the 2023 National People's Congress and Chinese People's Political Consultative Conference sessions, some delegates and committee members focused public attention on developing China’s own ChatGPT-style systems. For example, Liu Qingfeng, a deputy to the National People's Congress and chairman of iFlytek, advocated accelerating the construction of cognitive-intelligence large models on autonomous and controllable platforms so industry can access AI benefits. Another representative, Qian Jiasheng, recommended strengthening interdisciplinary AI science and technology education and advancing a "AI + discipline cluster" training model to build talent and innovation systems in the AI field.

Four major challenges for large-model deployment

1. Long data preparation time: data sources are dispersed, aggregation is slow, and preprocessing several hundred terabytes can take around 10 days.

2. Slow loading of massive small files: multimodal large models rely on massive text and image datasets, but current loading speeds for many small files are below 100 MB/s, reducing training data loading efficiency.

3. Frequent parameter tuning and instability: frequent training interruptions require checkpointing to resume, and failure recovery can take more than a day; the platform experiences training interruptions on average about once every two days.

4. High implementation barriers: complex system construction, difficult resource scheduling, and typically low GPU utilization, often below 40%.

Current storage requirements for large models

Large models are currently dominated by text-only unimodal applications, but as models are integrated with industry scenarios in China, many domestic large models have indicated plans to accelerate multimodal development. Data types will expand to include images, audio, and video, increasing storage demands.

Storage requirements will grow in two ways: training on massive, heterogeneous data and supporting data applications across a large number of endpoints. Insufficient storage capacity can affect model performance. Expanding from text to images, audio, and video will substantially increase data volume, with expected growth from several terabytes (1 TB = 1024 GB) in text-only datasets toward petabyte-scale multimodal datasets (1 PB = 1,000,000 GB). This trend places higher demands on storage architecture and performance.

Industry estimates project that by 2026 the AI software and applications market in China could reach about $21.1 billion, and Chinese technology companies are exploring new methods and models for running large models. However, practical deployment is necessary to realize the value of large models.

Conclusion

Data, algorithms, and compute are the driving forces behind AI development. Large models have strengthened the generality of AI technologies and advanced AI implementation. Going forward, deep integration of large models with application scenarios, supported by professional tools and platforms, and an open ecosystem can help translate models into deployed applications. Providing tools and methods that support the full deployment lifecycle can enable more organizations to benefit.

Sources referenced: Huawei: Huawei releases AI storage products for the large-model era; Pu Yin International Research: Storage trends for the large-model era based on Huawei AI storage products; China News Service: Large models drive diverse data processing and pose new storage requirements; Medical Technology Summit Forum: Depth and speed of large models.

How Large Language Models Impact Data Storage

Content