You are here: Home News & Alerts Alerts and Malfunctions Restrictions in the bwSFS (certain …

Restrictions in the bwSFS (certain users)

Unfortunately, there have been restrictions in the storage system for research data (bwSFS) for certain user groups (primarily Galaxy and bwCloud VMs) for a few days now. This is related to a severe delay in the tiering of the Hierarchical Storage Management, which first analyses revealed. The investigations are still ongoing ...

There have been major performance issues transferring data from the S3 tier / fabric pool back to the fast NVMe-based storage areas in our AFF A400 system for about 14 days. Thus, depending on the use case, there have been (significant) limitations in the storage system for research data (bwSFS) for Galaxy and VMs in the bwCloud since loading older files is accompanied by severe (extreme) delays. The cause is most likely the tiering of the Hierarchical Storage Management (into the S3 backend from bwSFS). Services such as bwCloud are not directly affected as they do not use this feature in the storage backend. Likewise, S3 in direct use does not appear to be affected. The analyses were done as a team effort by many stakeholders including the network group.Further steps with the potential involvement of the vendor will be taken starting Monday.

Update (5/27, 12 p.m.):

Tiering problems could also be triggered by other effects, such as the integration of a new storage grid node (configuration issue) or recovery of another. Inquiries with the provider are still running in parallel here. The effects in the bwCloud should be much smaller in most cases, since old blocks are rarely actually requested.

Update (5/31 9am):

Transfer rates are currently 1-3 Mb/s for the S3 tier, with no particular load on the storage system or network evident in monitoring. In the setup that has been in productive use since September 2021, around 100-200 client VMs access the NFS storage provided by the A400 in parallel around the clock. This was mostly possible without problems until about 14 days ago.

We have been intensively investigating the cause of the problems since they occurred, but have not been successful so far.