Conversation
manuba95
left a comment
There was a problem hiding this comment.
Hey @UnravelSports!
Thanks for the suggestion and the PR, I think this would add a lot of value to both Kloppy and floodlight! I added some specific comments below (mostly cosmetic) but also have some general ideas we should consider before a merge.
My main concern is that this is currently very taylored towards a two-team (home/away) scenario. Although this may be the use case for many official data providers, future GPS/LPS data support of Kloppy may result in different team (and period) constellations. I think it's worth thinking about how to generalize this PR now, so any number of groups and periods can be supported. I also left some comments about this below, but please let me know what you think about this as this would probably be the largest refactoring of your PR!
To answer your questions:
- I think "precedence" was used specifically in the secondspectrum parser to have a rule on how the xIDs are sorted. This seems to be a legacy from my point of view since the introduction of Teamsheets. But I have to say, I was not involved in this parser so maybe I am missing something. I generally don't see any value of adding this logic to the kloppy bridge.
- We don't rely on a specific orientation but that the x-axis is the longer side of the pitch
- No, the origin of the Pitch coordinate system gets defined by the
x/ylimargument. The scale can be set by thelength/widthandunitarguments. Also see my comment below regarding the get_pitch function. - I am not sure I understand the question. For the first creation of the XY array we need the number of players an number of frames. The latter could be extracted from the first and last frame ID or (as you do) from the length of the df. Please elaborate if this doesn't answer your question.
- Yes, different metadata is stored as attributes of the Core objects (e.g., framerate is an attribute of the XY object). But I don't think they would differ between data from the same provider (e.g. framerate in XY and possession Code from the same provider should be the same)
- We do not support z coordinates
- Yes, I commented on that below as well. We are currently not very consistent with the period naming. I think for now it does not matter to much but streamlining this terminology could be a future project, For now, you could go with the terminology used in Kloppy ("Period_1" or similar). Honestly, I currently don't have a really strong opinion on this.
- Events: I will comment on this later!
- Tests: We currently don't include unit tests for our parsers mainly for data availablity reasons (which is not optimal!). Some ideas on how to test the kloppy bridge would be appreciated!
General minor comments:
- Please merge into develop and add conventional PR name (e.g., "feat: Kloppy integration")
- Please check codestyle with Black and flake8
Thanks again for your work, I hope we can include this soon!
| @@ -0,0 +1,577 @@ | |||
| from floodlight import Teamsheet, Code, XY, Pitch | |||
There was a problem hiding this comment.
Please reorder imports: standard, third party, local. Then alphabetically.
There was a problem hiding this comment.
Please also reorder the functions to main/public first, private below
| pitch: Pitch | ||
| Pitch object with actual pitch length and width. | ||
| """ | ||
| from kloppy.domain import AttackingDirection |
There was a problem hiding this comment.
Is there a good reason I'm missing to have the imports within the function calls? I don't think calling anything from this module makes sense without having kloppy installed, right? Would it make sense to put all kloppy imports to the top of the file while also checking if its installed?
| periods = {} | ||
| directions = {} | ||
|
|
||
| metadata_dict["framerate"] = metadata.frame_rate |
There was a problem hiding this comment.
I think medatada_dict.update({"length": ... , "width": ...}) would be a bit cleaner here, but thats just my opinion.
|
|
||
| metadata_dict["framerate"] = metadata.frame_rate | ||
| metadata_dict["length"] = ( | ||
| metadata.pitch_dimensions.x_dim if metadata.pitch_dimensions else None |
There was a problem hiding this comment.
Would it make sense to put pitch_dimensions.pitch_length/width here? Then the metadata_dictionary would have not include kloppy objects. Right now, the metadata_dict["length/width"] does not get called again while the pitch object creation calls the .pitch_length/width later.
| metadata_dict["away_tID"] = away_team.team_id | ||
|
|
||
| for period in metadata.periods: | ||
| segment = f"HT{period.id}" |
There was a problem hiding this comment.
We are currently not very consistent in naming our periods but keep it more or less to the naming convention of the provider. E.g., with secondspectrum we use "HT1/2", with the DFL data, we use "firstHalf"/"secondHalf". This may not be optimal from our side, and I think a good solution would be to use Kloppys naming convention when transforming from Kloppy to floodlight, so e.g., "Period_1"/"P1" or something similar. Any thoughts?
There was a problem hiding this comment.
I have two thoughts on this. It should be consistent, and it should probably be called "Period". I'll change it to "Period_{id}"
| return ballstatus_code | ||
|
|
||
|
|
||
| def _create_ballstatus_code(df: pd.DataFrame) -> np.ndarray: |
There was a problem hiding this comment.
This seems to be duplicate.
|
|
||
|
|
||
| def _extract_team_xy_data( | ||
| df: pd.DataFrame, team: "Team", teamsheet: "Teamsheet" |
There was a problem hiding this comment.
Typing of team and teamsheet is incorrect (string). Also, although private, df could be a bit more explicit.
| Array with shape (n_frames, 2*n_players) containing x,y coordinates | ||
| """ | ||
| # Ensure teamsheet has xIDs | ||
| if "xID" not in teamsheet.teamsheet.columns: |
There was a problem hiding this comment.
Is this necessary? It already gets checked in the main function.
| teamsheet.add_xIDs() | ||
|
|
||
| # Get mapping from jersey number to array index (xID) | ||
| links_jID_to_xID = teamsheet.get_links("jID", "xID") |
There was a problem hiding this comment.
This expects the jID to exist and be unique. can we expect that? You could use the "player" column, since it has to exist in the Teamsheets and its likely to be unique (name or constructed)
| links_jID_to_xID = teamsheet.get_links("jID", "xID") | ||
|
|
||
| # Get the maximum xID to determine array size | ||
| max_xID = max(links_jID_to_xID.values()) if links_jID_to_xID else 0 |
There was a problem hiding this comment.
would n_players = len(links_jID_to_xID) be sufficient?
I've been working on #160. Please find the draft in this PR.
It contains 2 files:
floodlight.io.from_kloppyfrom_kloppy.ipynbto simply run everything and check it out.I also have some questions:
What is the purpose of "precedence" in the Teamsheet creation? Does it matter that we simply use the kloppy PositionType enum to order here? This does do GK, DF, MF, FW, but the order within each line is for example LB, RB, LCB, RCB.
Does floodlight enforce or rely on some specific orientation? I.e. home left, away right for the whole game?
Does floodlight enforce or rely on a specific coordinate system with pitch dimensions? I'm currently transforming to "secondspectrum", because it seems your
Pitchobject needs a string name and "secondspectrum" is supported.I assume we rely on Period start and end frame ids? Are they really necessary, or only for creating XY objects?
Does floodlight record different metadata for event and tracking data? (e.g. frame rate)
Does floodlight not support z coordinates?
I noticed In some xy_objects we have "HT1" "HT2" (e.g. second spectrum) and in others we get "firstHalf" and "secondHalf" (e.g. in IDSSEDataset). Does this matter?
For events:
eIDand would it matter if we simply use kloppy EventType for this?gameclockseconds since start of period?outcome1.0 and 0.0?timestampis not a datetime, but atimedeltainstead?qualifiersimply the raw event we get?Tests are still missing