Kloppy Integration by UnravelSports · Pull Request #161 · floodlight-sports/floodlight

UnravelSports · 2025-06-05T08:10:21Z

I've been working on #160. Please find the draft in this PR.

It contains 2 files:

floodlight.io.from_kloppy
from_kloppy.ipynb to simply run everything and check it out.

I also have some questions:

What is the purpose of "precedence" in the Teamsheet creation? Does it matter that we simply use the kloppy PositionType enum to order here? This does do GK, DF, MF, FW, but the order within each line is for example LB, RB, LCB, RCB.
Does floodlight enforce or rely on some specific orientation? I.e. home left, away right for the whole game?
Does floodlight enforce or rely on a specific coordinate system with pitch dimensions? I'm currently transforming to "secondspectrum", because it seems your Pitch object needs a string name and "secondspectrum" is supported.
I assume we rely on Period start and end frame ids? Are they really necessary, or only for creating XY objects?
Does floodlight record different metadata for event and tracking data? (e.g. frame rate)
Does floodlight not support z coordinates?
I noticed In some xy_objects we have "HT1" "HT2" (e.g. second spectrum) and in others we get "firstHalf" and "secondHalf" (e.g. in IDSSEDataset). Does this matter?
For events:
- What is the purpose of eID and would it matter if we simply use kloppy EventType for this?
- Is gameclock seconds since start of period?
- Is outcome 1.0 and 0.0?
- How much would it matter if timestamp is not a datetime, but a timedelta instead?
- Is qualifier simply the raw event we get?
Tests are still missing

manuba95

Hey @UnravelSports!
Thanks for the suggestion and the PR, I think this would add a lot of value to both Kloppy and floodlight! I added some specific comments below (mostly cosmetic) but also have some general ideas we should consider before a merge.

My main concern is that this is currently very taylored towards a two-team (home/away) scenario. Although this may be the use case for many official data providers, future GPS/LPS data support of Kloppy may result in different team (and period) constellations. I think it's worth thinking about how to generalize this PR now, so any number of groups and periods can be supported. I also left some comments about this below, but please let me know what you think about this as this would probably be the largest refactoring of your PR!

To answer your questions:

I think "precedence" was used specifically in the secondspectrum parser to have a rule on how the xIDs are sorted. This seems to be a legacy from my point of view since the introduction of Teamsheets. But I have to say, I was not involved in this parser so maybe I am missing something. I generally don't see any value of adding this logic to the kloppy bridge.
We don't rely on a specific orientation but that the x-axis is the longer side of the pitch
No, the origin of the Pitch coordinate system gets defined by the x/ylim argument. The scale can be set by the length/width and unit arguments. Also see my comment below regarding the get_pitch function.
I am not sure I understand the question. For the first creation of the XY array we need the number of players an number of frames. The latter could be extracted from the first and last frame ID or (as you do) from the length of the df. Please elaborate if this doesn't answer your question.
Yes, different metadata is stored as attributes of the Core objects (e.g., framerate is an attribute of the XY object). But I don't think they would differ between data from the same provider (e.g. framerate in XY and possession Code from the same provider should be the same)
We do not support z coordinates
Yes, I commented on that below as well. We are currently not very consistent with the period naming. I think for now it does not matter to much but streamlining this terminology could be a future project, For now, you could go with the terminology used in Kloppy ("Period_1" or similar). Honestly, I currently don't have a really strong opinion on this.
Events: I will comment on this later!
Tests: We currently don't include unit tests for our parsers mainly for data availablity reasons (which is not optimal!). Some ideas on how to test the kloppy bridge would be appreciated!

General minor comments:

Please merge into develop and add conventional PR name (e.g., "feat: Kloppy integration")
Please check codestyle with Black and flake8

Thanks again for your work, I hope we can include this soon!

manuba95 · 2025-07-03T14:31:43Z

floodlight/io/from_kloppy.py

@@ -0,0 +1,577 @@
+from floodlight import Teamsheet, Code, XY, Pitch


Please reorder imports: standard, third party, local. Then alphabetically.

Please also reorder the functions to main/public first, private below

manuba95 · 2025-07-03T14:35:45Z

floodlight/io/from_kloppy.py

+    pitch: Pitch
+        Pitch object with actual pitch length and width.
+    """
+    from kloppy.domain import AttackingDirection


Is there a good reason I'm missing to have the imports within the function calls? I don't think calling anything from this module makes sense without having kloppy installed, right? Would it make sense to put all kloppy imports to the top of the file while also checking if its installed?

manuba95 · 2025-07-03T14:39:05Z

floodlight/io/from_kloppy.py

+    periods = {}
+    directions = {}
+
+    metadata_dict["framerate"] = metadata.frame_rate


I think medatada_dict.update({"length": ... , "width": ...}) would be a bit cleaner here, but thats just my opinion.

manuba95 · 2025-07-03T14:44:41Z

floodlight/io/from_kloppy.py

+
+    metadata_dict["framerate"] = metadata.frame_rate
+    metadata_dict["length"] = (
+        metadata.pitch_dimensions.x_dim if metadata.pitch_dimensions else None


Would it make sense to put pitch_dimensions.pitch_length/width here? Then the metadata_dictionary would have not include kloppy objects. Right now, the metadata_dict["length/width"] does not get called again while the pitch object creation calls the .pitch_length/width later.

manuba95 · 2025-07-03T14:47:52Z

floodlight/io/from_kloppy.py

+    metadata_dict["away_tID"] = away_team.team_id
+
+    for period in metadata.periods:
+        segment = f"HT{period.id}"


We are currently not very consistent in naming our periods but keep it more or less to the naming convention of the provider. E.g., with secondspectrum we use "HT1/2", with the DFL data, we use "firstHalf"/"secondHalf". This may not be optimal from our side, and I think a good solution would be to use Kloppys naming convention when transforming from Kloppy to floodlight, so e.g., "Period_1"/"P1" or something similar. Any thoughts?

I have two thoughts on this. It should be consistent, and it should probably be called "Period". I'll change it to "Period_{id}"

manuba95 · 2025-07-03T15:49:43Z

floodlight/io/from_kloppy.py

+    return ballstatus_code
+
+
+def _create_ballstatus_code(df: pd.DataFrame) -> np.ndarray:


This seems to be duplicate.

manuba95 · 2025-07-03T15:51:00Z

floodlight/io/from_kloppy.py

+
+
+def _extract_team_xy_data(
+    df: pd.DataFrame, team: "Team", teamsheet: "Teamsheet"


Typing of team and teamsheet is incorrect (string). Also, although private, df could be a bit more explicit.

manuba95 · 2025-07-03T15:52:16Z

floodlight/io/from_kloppy.py

+        Array with shape (n_frames, 2*n_players) containing x,y coordinates
+    """
+    # Ensure teamsheet has xIDs
+    if "xID" not in teamsheet.teamsheet.columns:


Is this necessary? It already gets checked in the main function.

manuba95 · 2025-07-03T15:53:10Z

floodlight/io/from_kloppy.py

+        teamsheet.add_xIDs()
+
+    # Get mapping from jersey number to array index (xID)
+    links_jID_to_xID = teamsheet.get_links("jID", "xID")


This expects the jID to exist and be unique. can we expect that? You could use the "player" column, since it has to exist in the Teamsheets and its likely to be unique (name or constructed)

manuba95 · 2025-07-03T15:54:30Z

floodlight/io/from_kloppy.py

+    links_jID_to_xID = teamsheet.get_links("jID", "xID")
+
+    # Get the maximum xID to determine array size
+    max_xID = max(links_jID_to_xID.values()) if links_jID_to_xID else 0


would n_players = len(links_jID_to_xID) be sufficient?

draft PR, from_kloppy

04a8621

manuba95 requested changes Jul 3, 2025

View reviewed changes

		@@ -0,0 +1,577 @@
		from floodlight import Teamsheet, Code, XY, Pitch

		return ballstatus_code


		def _create_ballstatus_code(df: pd.DataFrame) -> np.ndarray:



		def _extract_team_xy_data(
		df: pd.DataFrame, team: "Team", teamsheet: "Teamsheet"

Conversation

UnravelSports commented Jun 5, 2025

Uh oh!

manuba95 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants