Suggestions: Include a Chao-Jaccard abundance index function for repOverlap 

https://github.com/immunomind/immunarch/blob/9dce2ea78f4fb15557e17664910f14c47caa8e63/R/overlap.R#L222Anne 

The formula for the Jaccard index (though it might technically be a coefficient) used is a comparison of sequence occurrence, regardless of how often it appears in the set (comparison of unique sequence use) to compare similarity.

However, the **[Chao Jaccard Index](https://pubmed.ncbi.nlm.nih.gov/23094376/)**  is an abundance based index which takes into account the number of times a particular combination occurs.

We frequently use both of these metrics to calculate repertoire overlap.

I think the Chao Jaccard Index can be included in the overlap.R file by adding the following R code:

## Change line 98 of overlap.R to:
                       .method = c("public", "overlap", "jaccard", "chao_jaccard_abundance_index", "tversky", "cosine", "morisita", "inc+public", "inc+morisita"),

## Add the following functions in at line 222 of overlap.R:
chao_jaccard_abundance_index <- function(.x, .y) {
  UseMethod("chao_jaccard_abundance_index")
}

chao_jaccard_abundance_index.default <- function(.x, .y) {
  .x <- collect(.x, n = Inf)
  .y <- collect(.y, n = Inf)
  intersection <- nrow(dplyr::intersect(.x, .y))
  proportion_of_x_in_y_counting_all_seqs <- intersection / nrow(.y)
  proportion_of_y_in_x_counting_all_seqs <- intersection / nrow(.x)
  (proportion_of_x_in_y_counting_all_seqs * proportion_of_y_in_x_counting_all_seqs) / (proportion_of_x_in_y_counting_all_seqs + proportion_of_y_in_x_counting_all_seqs - (proportion_of_x_in_y_counting_all_seqs * proportion_of_y_in_x_counting_all_seqs))
}

chao_jaccard_abundance_index.character <- function(.x, .y) {
  intersection <- nrow(dplyr::intersect(.x, .y))
  proportion_of_x_in_y_counting_all_seqs <- intersection / nrow(.y)
  proportion_of_y_in_x_counting_all_seqs <- intersection / nrow(.x)
  (proportion_of_x_in_y_counting_all_seqs * proportion_of_y_in_x_counting_all_seqs) / (proportion_of_x_in_y_counting_all_seqs + proportion_of_y_in_x_counting_all_seqs - (proportion_of_x_in_y_counting_all_seqs * proportion_of_y_in_x_counting_all_seqs))
}

## Clarifying Points
I'm happy to add a pull request to add this, but I'm not sure if this chao_jaccard_abundance_index function is correct. The math is right, but it requires taking the _sum_ of all the sequences in X, Y **not** the _clones_. I'm not sure exactly what the function is taking in, but I think it is on the clonal level, which would not be right.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions: Include a Chao-Jaccard abundance index function for repOverlap #85

Change line 98 of overlap.R to:

Add the following functions in at line 222 of overlap.R:

Clarifying Points

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Suggestions: Include a Chao-Jaccard abundance index function for repOverlap #85

Description

Change line 98 of overlap.R to:

Add the following functions in at line 222 of overlap.R:

Clarifying Points

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions