Skip to content

Conversation

@FlipperPA
Copy link
Contributor

@FlipperPA FlipperPA commented Mar 17, 2025

When converting a CSV to STATA, I received the following error: src/bin/read_csv/mod_dta.c:404 unsupported variable type 3

This PR adds support for DATE_TIMEs to the STATA writer. Here's the test CSV that was used:

orgpermid,valuecalcdt,justdate
4295533401,2025-03-01 00:00:00,2024-03-01
4295533401,2023-08-26 00:00:00,2024-08-26
4295533401,1974-02-08 00:00:00,2024-02-08
4295533401,1972-06-29 23:59:59.123456+05,2024-02-21
4295533401,1854-03-01 00:00:00,2024-01-01
4295533401,1974-02-08 11:38:23.543212-02,2024-02-08
4295533401,,2024-02-21
4295533401,1972-06-29 23:59:59,2024-02-21
4295533401,1854-03-01 10:42:42,2024-01-01

Here's the JSON mapping file:

{
    "type": "STATA",
    "variables": [
        {
            "type": "NUMERIC",
            "name": "orgpermid",
            "label": "OrgPermID (orgpermid)",
            "format": "UNSPECIFIED"
        },
        {
            "type": "NUMERIC",
            "name": "valuecalcdt",
            "label": "ValueCalcDt (valuecalcdt)",
            "format": "DATE_TIME"
        },
        {
            "type": "NUMERIC",
            "name": "justdate",
            "label": "JustDate (justdate)",
            "format": "DATE"
        }
    ]
}

...and the output in STATA from the written file:

. list

     +-------------------------------------------+
     | orgper~d          valuecalcdt    justdate |
     |-------------------------------------------|
  1. | 4.30e+09   01mar2025 00:00:00   01mar2024 |
  2. | 4.30e+09   26aug2023 00:00:00   26aug2024 |
  3. | 4.30e+09   08feb1974 00:00:00   08feb2024 |
  4. | 4.30e+09   29jun1972 23:59:59   21feb2024 |
  5. | 4.30e+09   01mar1854 00:00:00   01jan2024 |
     |-------------------------------------------|
  6. | 4.30e+09   08feb1974 11:38:23   08feb2024 |
  7. | 4.30e+09                    .   21feb2024 |
  8. | 4.30e+09   29jun1972 23:59:59   21feb2024 |
  9. | 4.30e+09   01mar1854 10:42:42   01jan2024 |
     +-------------------------------------------

...and to validate millisecond display:

. format valuecalcdt %tCHH:MM:SS.sss

. list

     +-------------------------------------+
     | orgper~d    valuecalcdt    justdate |
     |-------------------------------------|
  1. | 4.30e+09   00:00:00.000   01mar2024 |
  2. | 4.30e+09   00:00:00.000   26aug2024 |
  3. | 4.30e+09   00:00:00.000   08feb2024 |
  4. | 4.30e+09   23:59:59.123   21feb2024 |
  5. | 4.30e+09   00:00:00.000   01jan2024 |
     |-------------------------------------|
  6. | 4.30e+09   11:38:23.543   08feb2024 |
  7. | 4.30e+09              .   21feb2024 |
  8. | 4.30e+09   23:59:59.000   21feb2024 |
  9. | 4.30e+09   10:42:42.000   01jan2024 |
     +-------------------------------------+

In reviewing the code, I also noticed the patterns for the three DATE_TIME patterns did not match the Date-Time Unit from the source spec here: https://libguides.library.kent.edu/SPSS/DatesTime

I've also changed those patterns to match the spec.

@FlipperPA FlipperPA changed the title WIP: Fix patterns for DATE_TIMEs. WIP: Support DATE_TIMEs in STATA Mar 18, 2025
@FlipperPA FlipperPA changed the title WIP: Support DATE_TIMEs in STATA Add Support for DATE_TIMEs in STATA Mar 18, 2025
Copy link
Contributor

@evanmiller evanmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. I left some suggestions.

fprintf(stderr, "%s:%d not a valid date-time: %s (expected format: yyyy-mm-dd hh:MM:SS with optional milliseconds. Datetime string is truncated at 23 characters to ignore microseconds and timezone information.)\n", __FILE__, __LINE__, date_time);
exit(EXIT_FAILURE);
}
int missing_ranges_count = readstat_variable_get_missing_ranges_count(var);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this logic removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had initially copied the value_double_dta to value_double_date_time_dta to ensure code was being reached, before implementing value_double_date_time_dta as its own function. I wasn't sure how this code would handle blank date fields in the CSV.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm guessing missing ranges are seldom used with dates, so let's leave it out

@FlipperPA
Copy link
Contributor Author

@evanmiller Thanks for the feedback and suggestions, I'll be the first to admit my C skills are very, very rusty. If there's value in re-inserting the missing_ranges_count, I'm happy to add it back.

@evanmiller
Copy link
Contributor

The macOS build is unhappy for some reason; not sure if related to your change or not

@FlipperPA
Copy link
Contributor Author

The macOS build is unhappy for some reason; not sure if related to your change or not

@evanmiller Any tips on how I can best figure these out? I've been compiling on Rocky 9 but don't have an easy way to test on MacOS. Thanks again for your guidance and expertise.

@evanmiller
Copy link
Contributor

Looks like a pre-existing error, so I'm not going to sweat it. Thanks for the contribution.

@evanmiller evanmiller merged commit 14f3937 into WizardMac:dev Mar 24, 2025
9 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants