Details
-
Bug
-
Status: Open
-
Blocker
-
Resolution: Unresolved
-
0.17
-
None
Description
Running into an issue where Azure Storage produces Microsoft.WindowsAzure.Storage.StorageException Error 503 server unavailable when I run a job that downloads data partitions from Azure Storage to 80 evaluators or more. This does not happen when using 64 evaluators. Full stack trace below.
Org.Apache.REEF.IMRU.OnREEF.Driver.IMRUDriver`4[[Microsoft.MachineLearning.Distributed.Core.Trainers.KMeans.InputOutput.KMeansInputOutput, Microsoft.MachineLearning.Distributed.Core, Version=0.3.0.0, Culture=neutral, PublicKeyToken=null],[Microsoft.MachineLearning.Distributed.Core.Trainers.KMeans.InputOutput.KMeansInputOutput, Microsoft.MachineLearning.Distributed.Core, Version=0.3.0.0, Culture=neutral, PublicKeyToken=null],[Microsoft.MachineLearning.Runtime.IPredictor, Microsoft.MachineLearning.Core, Version=3.9.290.3615, Culture=neutral, PublicKeyToken=d353f9ba84f0e281],[Microsoft.MachineLearning.Distributed.Core.Common.IPipeline, Microsoft.MachineLearning.Distributed.Core, Version=0.3.0.0, Culture=neutral, PublicKeyToken=null]] Warning: 0 : 2018-05-11T00:59:28.4674513+00:00 0031 : WARNING: Received IFailedEvaluator bf0bcb92-5773-448d-bffa-6c478b619beb from endpoint unknown_endpoint with systemState WaitingForEvaluator in retry# 0 with Exception: Org.Apache.REEF.Driver.Evaluator.EvaluatorException: One or more errors occurred. ---> System.AggregateException: One or more errors occurred. ---> Microsoft.WindowsAzure.Storage.StorageException: The remote server returned an error: (503) Server Unavailable. ---> System.Net.WebException: The remote server returned an error: (503) Server Unavailable.
at Microsoft.WindowsAzure.Storage.Shared.Protocol.HttpResponseParsers.ProcessExpectedStatusCodeNoException[T](HttpStatusCode expectedStatusCode, HttpStatusCode actualStatusCode, T retVal, StorageCommandBase`1 cmd, Exception ex)
at Microsoft.WindowsAzure.Storage.Blob.CloudBlob.<>c_DisplayClass1e.<GetBlobImpl>b_1b(RESTCommand`1 cmd, HttpWebResponse resp, Exception ex, OperationContext ctx)
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndGetResponse[T](IAsyncResult getResponseResult)
— End of inner exception stack trace —
at Microsoft.WindowsAzure.Storage.Core.Util.StorageAsyncResult`1.End()
at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c_DisplayClass4.<CreateCallbackVoid>b_3(IAsyncResult ar)
— End of inner exception stack trace —
at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at Org.Apache.REEF.IO.FileSystem.AzureBlob.AzureCloudBlockBlob.DownloadToFile(String path, FileMode mode)
at Org.Apache.REEF.IO.PartitionedData.FileSystem.FileSystemInputPartition`1.Download()
at Org.Apache.REEF.IO.PartitionedData.FileSystem.FileSystemInputPartition`1.Cache()
at Org.Apache.REEF.IMRU.OnREEF.Driver.DataLoadingContext`1.OnNext(IContextStart value)
at Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextLifeCycle.Start()
at Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextRuntime..ctor(IInjector serviceInjector, IConfiguration contextConfiguration, Optional`1 parentContext)
— End of inner exception stack trace ---.