Skip to content

Conversation

@khoale88
Copy link
Contributor

@khoale88 khoale88 commented Nov 27, 2025

  • handle multigraph to match with neo4j -> treat each edge as a row
  • return [] for no edge, {} for 1 edge and [{}, {}] for multi edge
  • refactor node/edge mapping to easier to follow
  • handle zero edge hop better
  • fix various bug

@codspeed-hq
Copy link

codspeed-hq bot commented Nov 27, 2025

CodSpeed Performance Report

Merging #96 will degrade performances by 25.02%

Comparing khoale88:issue-93 (f26c01f) with master (e9dcd88)

Summary

⚡ 1 improvement
❌ 3 regressions
✅ 93 untouched
🆕 2 new

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
test_alias_with_order_by 48.2 ms 21.7 ms ×2.2
test_simple_multi_edge 108 ms 144.1 ms -25.02%
test_skip_and_limit 138.1 ms 175.4 ms -21.26%
test_skip_only 108.8 ms 144.8 ms -24.86%
🆕 test_node_type_edge_zero_hop_complex[DiGraph] N/A 58.5 ms N/A
🆕 test_node_type_edge_zero_hop_complex[MultiDiGraph] N/A 58.6 ms N/A

- no edge: [].
- 1 edge: {key: value}
- 2 edges: [{k:v}, {k: v}]. Cannot access attributes in return and
  where.
@khoale88
Copy link
Contributor Author

khoale88 commented Nov 29, 2025

Hi @j6k4m8 , I'm doing this big refactor with a lot of changes. I need your help in reviewing and analyzing the impact and correctness of this.

The PR comes with 4 commites so I hope it's eaiser to follow.

return and check edge following neo4j

  • unwinde multi edge
  • no edge: [].
  • 1 edge: {key: value}
  • 2 edges: [{k:v}, {k: v}]. Cannot access attributes in return and
    where.

In this commit, I separate the get_edge and get_node into 2 different utilities.

Also I unwind multiedge graph such that each edge is a row. This mean no more labels and attribute aggregation and confusion. I'll explain the unwind multi edge later in the 3rd commit. But it's the same as edge hop, I will take every single edge key between 2 nodes, and we have catesian product where each edge with unique edge key is a row.

Along the way, I treat relation as a list of edges, not just None or the one aggregated edge as we are having. This means some thing break if we access edge properties in where statement and return statement. As stated in the commit message

  • no edge is equivalent to []
  • one edge is equivalent to the edge it self {}. So the property access is working here
  • two or more edges is equivalent to [{}, {}]. Property access is confusing and is prohibited. I believe we need to implement relationship, ANY, ALL, and some other function to achieve the filtering on relationship

refactor edgehop to util

Our code is supporting variadic edge hop by generate different motif for every combination of edge hop value. The edge hop catesian production and generation is refactored to use models and classes.

I introduce the struct.HopSpec to store the original edge, the nodes after hop, the and the hop_count for this. It's important to store the hop_count for clarity and later usage

I delegate the realization of the motif to materialize_motif while I store the catesian product of edge hop edges in struct.HopAssignment (just a dictionary), and under the help of generate_edge_hop_specs and generate_hop_assignments

docs and test

This is important one as wel as I test and document some refactored function earlier like get_edge_from_host, find_multiedge_keys, generate_multiedge_edge_hop_key, etc.

To improve maintainability, We now have so many data structures like key, hop, edge mapping, node mapping (match) passing around. I deliberately introduce EdgePath, kindda similar to path = (a, b, c, d, e). With EdgePath, we can access the path itself, or the edges for the path, similar to edges = ((a, b), (b, c), (c, d), (d, e). And of course, we have the edge stored inside.

EdgeMapping is a wrapper for edge hops and edge keys together. You can access EdgePath of a relation from EdgeMapping without knowing much how to construct an EdgePath

Match model now is a combination of NodeMapping, EdgeMapping, and where_result (idk what is this for). I introduce a small model call MotifToHostView or mth. This should give us a lot of convenience not to map to a host node everytime we access a node in an edge

access nodes with mth
instead of assert match[0]["A"] == "a", we do assert match.mth.node("A") == "a'

access edges with mth
instead of

edge_path = edge_mapping[("A", "C")]
edge_path_edges = list(zip(edge_path[:-1], edge_path[1:]))
assert [(match[0][a], match[0][b]) for a, b in edge_path_edges] == [("a", "b"), ("b, "c")]

we do

edge_path = match.mth.edge("A", "C")
assert edge_path .edges == [("a", "b"), ("b, "c")]  # this is simplified example, the real model include edge key

touch up

some cosmetic docs and model-keys renamed.

Let me know if you have any concern.

P.S

I'm not sure why performance is not at par, do you have any insight? the catesian of multiedge keys should not be a main factor. This is not tested with nx.Graph

@j6k4m8
Copy link
Member

j6k4m8 commented Nov 29, 2025

WOOHOOO I'm reading through this now! 🎉

I'm not sure why performance is not at par, do you have any insight? the catesian of multiedge keys should not be a main factor. This is not tested with nx.Graph

I wouldn't worry about this, this performance benchmarks seem a little wonky anyway. I will override this if we need to, I don't think that this is a real performance regression but I'll double-check before we merge.

Copy link
Member

@j6k4m8 j6k4m8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a first pass and things look great — are you able to see the codspeed reports? Basically this looks like a bug on THEIR end, not something in the code:

Image

CPython internals are none of my business :)

Comment on lines 25 to 26
u: str
v: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
u: str
v: str
u: NodeID
v: NodeID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I make change in my local, together with future comments from you

)


class MotifToHostView:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such a smart data structure to have around

@j6k4m8
Copy link
Member

j6k4m8 commented Nov 29, 2025

I'm still reading a bit to make sure I understand what's going on, but this is an amazing update!
Will double check with some existing workflows to make sure it doesn't break anything in end-to-end tests on my side

@khoale88
Copy link
Contributor Author

khoale88 commented Nov 30, 2025

@j6k4m8 I have added the 5th commit. Sorry for this inconvenience. I merged the #92 into this actually

Collapse Nodes of Zero Edge Hop using Disjoint Set Union

We have some issues with handling zero edge hop.

chain zero hop edge

match (A) -[*0]->(B)-[*0]->(C)

This should collapse not only B to A, but also C to A

edge referencing to collapsed node

match  (A) -[*0]->(B) -->(C)

The edge (B, C) should become (A, C)

using DSU as a solution

DSU is known for solving this efficiently
after passing zero hop to DSU, we have alias that can be used to identify where is B and/or C is referenced to

Additional improvement

docs and tests

more docs and tests

Double check multi graph edges

As grandiso cannot efficiently filter multiple edges betwen 2 nodes, we have to do this in our code (line 1294-1309). We only do the double check if there are more than 1 edge between 2 nodes

if edge.u != edge.v and self._target_graph.number_of_edges(host_u, host_v) < 2:
    continue

direct equijoin fix

Grandiso doesn't work well with equijoin.

match (n) --> (n)

We fix this by double check the edge if it's a direct equijoin or skip the double check if it's not

if edge.u != edge.v and self._target_graph.number_of_edges(host_u, host_v) < 2:
    continue

@j6k4m8
Copy link
Member

j6k4m8 commented Dec 6, 2025

I finally got a chance to take a look at the latest — this is fabulous, I'm really excited to get this merged in :)

Okay with you for me to merge this? I am overriding the codspeed performance regression as it looks like it's a CPython internal.

@davidmezzetti I imagine this may impact some existing userbases if they have saved queries; my intention is to roll this as a v2.0.0 so it's not auto-updated in the existing consuming libraries and tools. Is that compatible with your use?

@khoale88
Copy link
Contributor Author

khoale88 commented Dec 6, 2025

I'm okay to merge. Regarding the release management, are we going to maintain the v1.x branches (fix bug, etc) for a while?

If possible, can we hold on to the release of v2.0? I'm going to have other PR soon.

@j6k4m8
Copy link
Member

j6k4m8 commented Dec 6, 2025

definitely, I'll merge this in but hold off on the 2* release!

@j6k4m8 j6k4m8 merged commit bf30e3e into aplbrain:master Dec 6, 2025
6 of 7 checks passed
@davidmezzetti
Copy link
Contributor

I finally got a chance to take a look at the latest — this is fabulous, I'm really excited to get this merged in :)

Okay with you for me to merge this? I am overriding the codspeed performance regression as it looks like it's a CPython internal.

@davidmezzetti I imagine this may impact some existing userbases if they have saved queries; my intention is to roll this as a v2.0.0 so it's not auto-updated in the existing consuming libraries and tools. Is that compatible with your use?

@j6k4m8 Exciting to see the library evolving and progressing! As long as the breaking changes are documented, I'll fix it in TxtAI where possible. When not possible, I'll at least be able to point to the docs on how to update queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants