-
Notifications
You must be signed in to change notification settings - Fork 0
Project split #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Project split #17
Changes from all commits
4421c45
54ea7e1
2939b2a
a3eb0d1
5160c25
0227f71
6694876
3fd4555
6644a88
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| /** Copyright 2020 ABSA Group Limited | ||
| * | ||
| * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| * an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations under the License. | ||
| */ | ||
|
|
||
| package za.co.absa | ||
|
|
||
| import za.co.absa.hash.HashUtils.HASH_COLUMN_NAME | ||
| import hash.HashUtils | ||
| import org.apache.spark.sql.{DataFrame, SparkSession} | ||
| import org.slf4j.{Logger, LoggerFactory} | ||
|
|
||
| /** Comparator object that compares two parquet files and creates metrics | ||
| */ | ||
| object Comparator { | ||
| private val logger: Logger = LoggerFactory.getLogger(Comparator.getClass) | ||
|
|
||
| /** Compare two DataFrames and return diff rows | ||
| * | ||
| * @param dataA | ||
| * A DataFrame whole data | ||
| * @param dataB | ||
| * B DataFrame whole data | ||
| * @param excludedColumns | ||
| * columns to exclude from comparison | ||
| * @param spark | ||
| * SparkSession | ||
| * @return | ||
| * Tuple of (diffA, diffB) DataFrames containing rows unique to each dataset | ||
| */ | ||
| def compare( | ||
| dataA: DataFrame, | ||
| dataB: DataFrame, | ||
| excludedColumns: Seq[String] = Seq.empty | ||
| )(implicit spark: SparkSession): (DataFrame, DataFrame) = { | ||
| // Apply column exclusion | ||
| val filteredDataA = DatasetComparisonHelper.exclude(dataA, excludedColumns, "A") | ||
| val filteredDataB = DatasetComparisonHelper.exclude(dataB, excludedColumns, "B") | ||
|
|
||
| val dfWithHashA: DataFrame = HashUtils.createHashColumn(filteredDataA) | ||
| logger.info("Hash for A created") | ||
|
|
||
| val dfWithHashB: DataFrame = HashUtils.createHashColumn(filteredDataB) | ||
| logger.info("Hash for B created") | ||
|
|
||
| // select non matching hashs | ||
| logger.info("Getting diff hashes, A except B") | ||
| val diffHashA: DataFrame = dfWithHashA.select(HASH_COLUMN_NAME).exceptAll(dfWithHashB.select(HASH_COLUMN_NAME)) | ||
| logger.info("Getting diff hashes, B except A") | ||
| val diffHashB: DataFrame = dfWithHashB.select(HASH_COLUMN_NAME).exceptAll(dfWithHashA.select(HASH_COLUMN_NAME)) | ||
|
|
||
| // join on hash column (get back whole rows) | ||
| logger.info("Getting diff rows for A") | ||
| val distinctDiffA: DataFrame = diffHashA.join(dfWithHashA, Seq(HASH_COLUMN_NAME)).distinct() | ||
| val diffA: DataFrame = diffHashA.join(distinctDiffA, Seq(HASH_COLUMN_NAME)) | ||
|
|
||
| logger.info("Getting diff rows for B") | ||
| val distinctDiffB: DataFrame = diffHashB.join(dfWithHashB, Seq(HASH_COLUMN_NAME)).distinct() | ||
| val diffB: DataFrame = diffHashB.join(distinctDiffB, Seq(HASH_COLUMN_NAME)) | ||
|
|
||
| (diffA, diffB) | ||
| } | ||
|
|
||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,53 @@ | ||
| /** Copyright 2020 ABSA Group Limited | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm gonna write it here so that we can have a chat based on this: I like the split and the overall structure, making things pretty obvious that one is CLI tool and the other is supposed to be an importable Scala library. However, naming (one of the difficulties of SW engineering, especially in prior-AI era) should be improved in my opinion. The module What about this? Rename:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So:
|
||
| * | ||
| * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| * an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations under the License. | ||
| */ | ||
|
|
||
| package za.co.absa.analysis | ||
|
|
||
| /** Result of row-by-row analysis with explicit outcome types. | ||
| */ | ||
| sealed trait AnalysisResult | ||
|
|
||
| object AnalysisResult { | ||
|
|
||
| /** Analysis completed successfully with row-by-row differences. | ||
| * | ||
| * @param diffAToB | ||
| * Differences from dataset A to B | ||
| * @param diffBToA | ||
| * Differences from dataset B to A | ||
| */ | ||
| case class Success(diffAToB: Seq[RowsDiff], diffBToA: Seq[RowsDiff]) extends AnalysisResult | ||
|
|
||
| /** Datasets are identical - no differences to analyze. | ||
| */ | ||
| case object DatasetsIdentical extends AnalysisResult | ||
|
|
||
| /** One dataset has differences but the other doesn't - cannot perform row-by-row matching. | ||
| * | ||
| * @param diffCountA | ||
| * Number of differences in dataset A | ||
| * @param diffCountB | ||
| * Number of differences in dataset B | ||
| */ | ||
| case class OneSidedDifference(diffCountA: Long, diffCountB: Long) extends AnalysisResult | ||
|
|
||
| /** Number of differences exceeds the threshold. | ||
| * | ||
| * @param diffCountA | ||
| * Number of differences in dataset A | ||
| * @param diffCountB | ||
| * Number of differences in dataset B | ||
| * @param threshold | ||
| * The threshold that was exceeded | ||
| */ | ||
| case class ThresholdExceeded(diffCountA: Long, diffCountB: Long, threshold: Int) extends AnalysisResult | ||
| } | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These packages must be changed. If we release this into the maven central, there won't be a single nice package containing the dataset comparison library, it will be scattered and living among other absa OSS stuff. I propose this, on high level for all of the files pretty much (I am not a big fan of
analysiseither - too broad of a word):package za.co.absa.datasetcomparisonorpackage za.co.absa.dataset.comparisonDo this for the lib and for the CLI part. Then, under the lib part, it could be:
package za.co.absa.datasetcomparison.coreand under CLI:
package za.co.absa.datasetcomparison.cliThen, further, but this is really optional, it would make sense to split the interfaces and the overall structure into sub-packages:
modelorapianddiffandmetrics- so splitting interfaces, metrics, and diff logic.The cli part is pretty ok in terms of its sub-packages.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in the end of the day, something like this: