Get Tests to 100%, Closes #220 by sundy1994 · Pull Request #361 · Watts-Lab/team_comm_tools

sundy1994 · 2025-06-17T23:24:07Z

Added tests for all columns in feature dict, Closes #220, Closes #359

….warn for warning message

…ting related tests

…like "XD" and ":D"

…; Add tests for 'num_named_entity'

…ity with CPU

xehu

Looks really good! Just a few small comments/questions.

src/team_comm_tools/feature_builder.py

xehu · 2025-06-25T01:46:13Z

src/team_comm_tools/features/reddit_tags.py

    """
-    emoji_pattern = r'[:;]-?\)+'
-    emojis = re.findall(emoji_pattern, text)
+    # emoji_pattern = r'[:;]-?\)+'


note to remove the commented out pattern before putting this in

tests/data/cleaned_data/test_chat_level.csv

xehu · 2025-06-25T02:45:54Z

Note: it looks like this branch crashes locally.

This can be reproduced by installing the testing_100 version of the code, navigating to the tests repo, and manually running run_tests.py. We get a failure in a mimicry feature.

Here's what the error looks like:

Successfully installed team_comm_tools-0.1.8
(tpm_virtualenv) xehu@WHA-ODD44VVQ-ML team_comm_tools % cd tests
(tpm_virtualenv) xehu@WHA-ODD44VVQ-ML tests % python3 run_tests.py
[nltk_data] Downloading package wordnet to /Users/xehu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Initializing Featurization...
Confirmed that data has conversation_id_col column: conversation_num!
Confirmed that data has speaker_id_col column: speaker_nickname!
Confirmed that data has message_col column: message!
Generating SBERT sentence vectors...
100%|█████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.21it/s]
Generating RoBERTa sentiments...
100%|█████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.87it/s]
ERROR: The length of the sentiment data does not match the length of the chat data. Regenerating...
Generating RoBERTa sentiments...
100%|█████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.18it/s]
Chat Level Features ...
  0%|                                                                        | 0/16 [00:00<?, ?it/s]WARNING: Failed to generate lexicons due to unexpected error: object of type 'float' has no len()
 31%|███████████████████▋                                           | 5/16 [00:00<00:00, 215.07it/s]
Traceback (most recent call last):
  File "/Users/xehu/Desktop/Team Process Mapping/team_comm_tools/tests/run_tests.py", line 57, in <module>
    test_positivity.featurize()
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/feature_builder.py", line 479, in featurize
    self.chat_level_features()
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/feature_builder.py", line 632, in chat_level_features
    self.chat_data = chat_feature_builder.calculate_chat_level_features(self.feature_methods_chat)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/utils/calculate_chat_level_features.py", line 102, in calculate_chat_level_features
    method(self)
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/utils/calculate_chat_level_features.py", line 326, in calculate_word_mimicry
    self.chat_data["content_word_accommodation_per_conv"] = Content_mimicry_score_per_conv(
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/features/word_mimicry.py", line 165, in Content_mimicry_score_per_conv
    ContWordFreq = compute_frequency_per_conv(df_conv, column_count_frequency)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/features/word_mimicry.py", line 106, in compute_frequency_per_conv
    return (dict(pd.Series(np.concatenate(df_temp[on_column])).value_counts()))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: need at least one array to concatenate

xehu · 2025-06-25T02:53:31Z

It seems like the source of the crash is that we're getting a nan in our conversation ID:

Currently processing mimicry for conversation I
  conversation_num  speaker_nickname  ... function_word_accommodation content_word_accommodation
0                I               1.0  ...                           0                          0
1                I               2.0  ...                           0                          0
2                I               3.0  ...                           0                          0
3                I               1.0  ...                           0                          0
4                I               3.0  ...                           0                          0

[5 rows x 25 columns]
content_words
Currently processing mimicry for conversation J
  conversation_num  speaker_nickname  ... function_word_accommodation content_word_accommodation
0                J               1.0  ...                           0                          0
1                J               2.0  ...                           0                          0
2                J               3.0  ...                           0                          0
3                J               1.0  ...                           0                          0
4                J               3.0  ...                           0                          0

[5 rows x 25 columns]
content_words
Currently processing mimicry for conversation nan
Empty DataFrame

The original input dataframe only has two conversation ID's -- I and J. For some weird reason, we're getting three: I, J, and nan.

sundy1994 · 2025-06-25T21:44:48Z

This issue was because I tried to optimize RAM usage, so instead of concatenating data frames in the for looop, I write all intermediate dfs to the output file in append mode. As a result, if the output file exists and we need to overwrite it, it won't be deleted but become longer.

The commit above resolves this. Now we append intermediate dfs to a list and concatenate only once at the end. It also saves RAM while making minimal changes.

xehu · 2025-06-25T02:22:08Z

src/team_comm_tools/utils/check_embeddings.py

-        batch_df = get_sentiment(batch)
-        batch_sentiments_df = pd.concat([batch_sentiments_df, batch_df], ignore_index=True)
+        batch_df = get_sentiment(batch, model_bert, device)
+        batch_df.to_csv(output_path, mode='a', header=first, index=False)


Note to self - we're now appending here; Emily to run vector tests locally and confirm that everything passes

tests/data/cleaned_data/test_chat_level.csv

xehu and others added 8 commits June 3, 2025 11:46

log features that are not yet tested

af386ef

Drop unhashable columns from orig_data; Use print instead of warnings…

a60720c

….warn for warning message

Fixes #359

3cadd01

Enhance emoji detection by utilizing a predefined emoji list and upda…

995c78b

…ting related tests

use original message column for emoji counting to include edge cases …

c65727d

…like "XD" and ":D"

Add politeness-related test cases

227931a

Add receptiveness-yeomans related test cases

0ec1043

Add missing column "num_line_breaks"; Improve feature counting method…

859f0e1

…; Add tests for 'num_named_entity'

sundy1994 requested a review from xehu June 17, 2025 23:24

sundy1994 added 4 commits June 23, 2025 15:30

Refactor bert generation to optimize RAM usage

a0ccdcc

Add GPU support for embeddings

7222a36

update function signatures and ensure device compatibility

43a3351

Fix tensor device handling in sentiment analysis to ensure compatibil…

c71951e

…ity with CPU

This was linked to issues Jun 25, 2025

FB Fails if Indices are Missing/Out of Order #359

Closed

Make sentiment BERT feature run faster #362

Closed

xehu requested changes Jun 25, 2025

View reviewed changes

bump pyproject.toml

810a2c0

fix file overwritten issue

e013560

Clean up and update test data

f6ee3c1

xehu approved these changes Jun 26, 2025

View reviewed changes

update documentation with new gpu parameter

9235309

xehu merged commit 1df76ce into dev Jun 26, 2025
1 check passed

xehu deleted the testing_100 branch June 26, 2025 22:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get Tests to 100%, Closes #220#361

Get Tests to 100%, Closes #220#361
xehu merged 16 commits intodevfrom
testing_100

sundy1994 commented Jun 17, 2025 •

edited

Loading

Uh oh!

xehu left a comment

Uh oh!

Uh oh!

xehu Jun 25, 2025

Uh oh!

Uh oh!

xehu commented Jun 25, 2025 •

edited

Loading

Uh oh!

xehu commented Jun 25, 2025

Uh oh!

sundy1994 commented Jun 25, 2025

Uh oh!

xehu Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sundy1994 commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xehu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xehu Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xehu commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note: it looks like this branch crashes locally.

Uh oh!

xehu commented Jun 25, 2025

Uh oh!

sundy1994 commented Jun 25, 2025

Uh oh!

xehu Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sundy1994 commented Jun 17, 2025 •

edited

Loading

xehu commented Jun 25, 2025 •

edited

Loading