Skip to content

s3 crawler -- doesn't handle "gracefully" situation of completely removed keys in a versioned bucket #73

@yarikoptic

Description

@yarikoptic

Use case is

(git)smaug:/mnt/datasets/datalad/crawl/adhd200/RawDataBIDS/WashU[master]git
$> datalad ls -aL 's3://fcp-indi/data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json'
Connecting to bucket: fcp-indi
[INFO   ] S3 session: Connecting to the bucket fcp-indi with authentication
Bucket info:
  Versioning: S3ResponseError: 403 Forbidden
     Website: S3ResponseError: 403 Forbidden
         ACL: S3ResponseError: 403 Forbidden
data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json 2020-02-12T21:29:03.000Z DeleteMarker
data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json 2020-02-12T21:26:29.000Z  996 ver:qbldQOJmB_DYp40eh3AeJlrgfZa281N2  acl:S3ResponseError: 404 Not Found
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json</Key><RequestId>AE14DA04D623BAC5</RequestId><HostId>jK7XM5sEAeaZz2bd1OCXVoMMb/0C46iKyzL4VpGsB7rcypyvU+CeaHKH3I5uh4LX84GNm8UZdZQ=</HostId></Error>  http://fcp-indi.s3.amazonaws.com/data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json?versionId=qbldQOJmB_DYp40eh3AeJlrgfZa281N2 [E: 403]
data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json 2020-02-12T15:05:14.000Z 1643 ver:_PImN3HbTRK9vXnFRgUg6Kq4HmF5Z7r.  acl:S3ResponseError: 404 Not Found
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json</Key><RequestId>AA76EA65F16A68E5</RequestId><HostId>JIf20KPlc5LwBbAy9cL40Z4yCQ+jDEGMUXMM1wouKw6t5uKqGvrwyhiIPsT9PUTPjLeQvyBdeZw=</HostId></Error>  http://fcp-indi.s3.amazonaws.com/data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json?versionId=_PImN3HbTRK9vXnFRgUg6Kq4HmF5Z7r. [E: 403]
data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json 2017-02-07T23:11:18.000Z 1407 ver:5110B.k9Fdmo3CErwRlDd4oqKL.Pf5Vp  acl:S3ResponseError: 404 Not Found
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json</Key><RequestId>C8EF28ACF8E83E31</RequestId><HostId>kXAKuNSOZCcCDMONQuW7LKrEuWX2WHm9TM8A9aigOzQpYkSoZass1YWkiR+IF7uk36n+h2WwcoI=</HostId></Error>  http://fcp-indi.s3.amazonaws.com/data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json?versionId=5110B.k9Fdmo3CErwRlDd4oqKL.Pf5Vp [OK]
data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json 2017-02-07T23:11:08.000Z DeleteMarker
data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json 2017-01-11T22:41:38.000Z 1407 ver:1NKbyRk7K0A4DoCrsbmcdDwO_yb7NOvs  acl:S3ResponseError: 404 Not Found
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json</Key><RequestId>59732ACF13CC104D</RequestId><HostId>5UPTjFRK9XfhWz3cECx7NUBVV4RvVC+7tlzNpdFFXEbojncvygFzeE7gmtmbDWOP0nYvqQVs5VA=</HostId></Error>  http://fcp-indi.s3.amazonaws.com/data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json?versionId=1NKbyRk7K0A4DoCrsbmcdDwO_yb7NOvs [OK]

ATM there is an option to completely skip all problematic files, but we do not want to apply the same rule to completely removed and permission denied (might still be fixed). So we need more specific option to tell which ones to skip, and which still fail on.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions