Skip to content

Scalar aggregation classes#97

Open
khoale88 wants to merge 6 commits intoaplbrain:masterfrom
khoale88:scalar_aggregation_classes
Open

Scalar aggregation classes#97
khoale88 wants to merge 6 commits intoaplbrain:masterfrom
khoale88:scalar_aggregation_classes

Conversation

@khoale88
Copy link
Contributor

@khoale88 khoale88 commented Dec 6, 2025

No description provided.

@codspeed-hq
Copy link

codspeed-hq bot commented Dec 6, 2025

CodSpeed Performance Report

Merging #97 will degrade performances by 10.85%

Comparing khoale88:scalar_aggregation_classes (4646928) with master (bf30e3e)

Summary

❌ 2 regressions
✅ 97 untouched

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
test_nested_nots_in_statements[MultiDiGraph] 33.3 ms 37.4 ms -10.85%
test_nested_nots_in_statements[DiGraph] 33.3 ms 37.3 ms -10.84%

@khoale88
Copy link
Contributor Author

khoale88 commented Dec 6, 2025

I know there might be some performance hit and I 'll take a look at that. I will provide details for this PR later, but basically, Refactoring and better support for scalar functions like ID, toLower, toUpper, Trim, Type, Coalesce, Size (return and left side condition). Support for better aggregation as well, including Collect, Relationships, list comprehensive syntax. And lastly Condition on list such as ANY, ALL, None, Single

@khoale88
Copy link
Contributor Author

khoale88 commented Dec 6, 2025

Commit 1: Support list predicates all, any, none, single, size, relationships

support list_expression and relationships function

The list expression can be the entity_id which is a list by itself or through relationships function

list_expression      : "relationships"i "(" entity_id ")"  -> relationships_function
                             | entity_id  -> entity_list

The relationships function is to retrieve all relationships in a mattern matching, but I don't have that ATM. Right now we simply have this in our test and support

        MATCH (a)-[r*1..3]->(c)
        WHERE size(relationships(r)) = 2
        RETURN r

Basically, it just return the value (already a list). But this should be the base sample for future support. This is an extension of ListExpression

class RelationshipsFunction(ListExpression)

support all/none/single/any

This works on list_expression. One of the sample is

        MATCH (a)-[r*2]->(c)
        WHERE all(edge IN r WHERE edge.weight > 5)
        RETURN r

These classes are just conditions. And they accept nested compound condition.

        MATCH (a)-[r*2]->(c)
        WHERE all(edge IN r WHERE edge.weight > 5 AND edge.type = "friend")
        RETURN r

SIZE function

we support size to compute size of a list. In this commit, it is an instance of Comdition, but it should be a Scalar Function.

Introduction to Scope.

the signatures of these function are quite the same as Condition, except that they accept scope (dict) parameter. Consider the following query, the scope variable is the "edge", element of r, which is not a, neither r, neither c. so match doesn't work.

        MATCH (a)-[r*2]->(c)
        WHERE all(edge IN r WHERE edge.weight > 5 AND edge.type = "friend")
        RETURN r

@khoale88
Copy link
Contributor Author

khoale88 commented Dec 6, 2025

Commit 2: refactor ID(A) to Scalar Function

ID and ScalarFunction

We have an intermediate class call ScalarFunction, it is an extension of Condition (idk why I inherit it as a Condition, but it doesn't really matter for now)

Under Scalar we have ID, and SIZE. Basically, the ID is getting a node id from match. So we nolonger have to manually handle ID(A) in the where, and the return.

For example in the lookup, we just call the scalar function data_path(...) to get the value. This should work with all scalar functions like ID and SIZE in both WHERE and RETURN statements

for data_path in data_paths:
            if isinstance(data_path, ScalarFunction):
                # Evaluate scalar function for each match
                ret = []
                for match in true_matches:
                    result_value = data_path(
                        match,
                        self._target_graph,
                        self._return_edges,
                        scope=None
                    )
                    ret.append(result_value)

                # Use str(data_path) as key: "ID(A)", "size(r)", etc.
                result[str(data_path)] = ret[offset_limit]
                processed_paths.add(data_path)
                processed_paths.add(str(data_path))
                continue

@khoale88
Copy link
Contributor Author

khoale88 commented Dec 6, 2025

Commit 3+4: refactor aggregation to classes

Class AggregationFunction and aggregation.

  • Introduce classes AggregationFunction (abstract) and concrete Count, Sum, Avg, Max, Min, Collect
  • The names are self-explainatory, some we already support in aggregation, but this is a little bit different. we have separated classes for aggregation instead of just a function handling if-else cases.
  • This should pave the way for handling aggregation in WHERE clause in the future with WITH statement.
  • Along the way, we discover something we do it wrong, and I haven't fix it. The following query should not work because we return AVR(r.value) and r.value at the same time, which is invalid query.
        MATCH (n)-[r]->()
        RETURN n.name, AVG(r.value), r.value
        ORDER BY AVG(r.value) DESC

ScalarFunction and others

  • Introduce ScalarFunction, ToLower, ToUpper, Trim, Type, and Coalesce
  • This should work in both RETURN and WHERE statements
  • Just realize I forgot to test scalar in WHERE, but this should work (manually checked)
    def test_string_functions_with_where(self):
        """Test string functions in WHERE clause"""
        host = nx.DiGraph()
        host.add_node("a", name="ALICE")
        host.add_node("b", name="BOB")
    
        qry = """
        MATCH (n)
        WHERE toLower(n.name) = "alice"
        RETURN n.name
        """
        res = GrandCypher(host).run(qry)
        assert set(res["n.name"]) == {"ALICE"}

introduction to EntityAttributeGetter

  • This is to answer the question: We handle Return ID(A), size(r), how about Return A, B.name? I just move the logic of getting variables EntityAttributeGetter, including getting match node/match edge/scope variable
def evaluate(self, match: Match, host: nx.DiGraph,
                 return_edges: dict = None, scope: dict = None):
        """
        Evaluate this entity reference against a match.

        Priority order for resolution:
        1. Scope variables (highest priority - for list predicates)
        2. Node mappings (standard case)
        3. Edge mappings (for edge references)
        4. None (not found)

        Args:
            match: The current match containing node mappings
            host: The graph to query
            return_edges: Optional edge mappings for edge references
            scope: Optional scope dictionary for list predicate variables

        Returns:
            The attribute value if found, None otherwise
        """
        # 1. Check scope first (highest priority for list predicates)
        if scope and self.entity in scope:
            element = scope[self.entity]
            if self.attribute:
                # Scope variable with attribute access: e.related
                return element.get(self.attribute) if isinstance(element, dict) else None
            # Simple scope variable: e
            return element

        # 2. Check node mappings (standard case)
        if self.entity in match.node_mappings:
            node_id = match.node_mappings[self.entity]
            if self.attribute:
                # Node with attribute: n.name
                return host.nodes[node_id].get(self.attribute)
            # Simple node reference: n - return full node dictionary
            return dict(host.nodes[node_id])

        # 3. Check edge mappings (for edge references)
        if return_edges and self.entity in return_edges:
            edge_mapping = return_edges[self.entity]
            host_edges = match.mth.edge(*edge_mapping).edges
            return get_edge_from_host(host, host_edges, self.attribute)

        return None

Commit 5: docs

let's check docs/architecture docs

@khoale88
Copy link
Contributor Author

khoale88 commented Dec 6, 2025

More work to be done

  • more tests regarding scalar and aggregation, simple and complex, in WHERE clause.
  • more throughout test with auto-hint to see these changes does trigger auto hints properly or not?

@khoale88 khoale88 force-pushed the scalar_aggregation_classes branch from 2fdc232 to acc89ca Compare December 13, 2025 17:52
@khoale88 khoale88 force-pushed the scalar_aggregation_classes branch from acc89ca to 4646928 Compare December 13, 2025 18:01
@j6k4m8
Copy link
Member

j6k4m8 commented Dec 19, 2025

@khoale88 sorry I haven't been commenting — I HAVE been following this closely, this is really exciting work, and I think gets us away from a lot of the hacks this repo was using earlier on and puts us on much firmer ground. THANK YOU!

I'm wondering how you're thinking about approaching functions like size where they can act on a list or on, say, a string:

size('hello') // should return 5

[docs]

It looks like the function should support it, though I think the grammar expects a list right now?

I'm super excited about this PR!!!

@khoale88
Copy link
Contributor Author

you are right @j6k4m8 , syntax support should be easy, let me add that ... after the break :D.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants