Yandex.Cloud S3 HDFS HIVE MapReduce TEZ YARN HiveSQL CLI Shell Hadoop Cluster Administration
Task: To provide constant access to cold data, create a 'Star` scheme, create a showcase of the form:
| Payment type | Date | Tips average amount | Passengers total |
|---|---|---|---|
| Cash | 2020-01-31 | 999.99 | 112 |
1,2020-04-01 00:41:22,2020-04-01 01:01:53,1,1.20,1,N,41,24,2,5.5,0.5,0.5,0,0,0.3,6.8,0
Learn more about the data source here
- Deploying a Hadoop cluster using a
Yandex.Cloudsolution : - Creating a bucket using a
S3Yandex.Cloud solution. - Downloading data (database) to created
s3bucket usingdistcp. - Creating & configure Database (database). Setting configuration
Hive - TEZ.- "payment" according to the description of the data format. The storage format is parquet.
- The names of the id and name fields. Filling dimension table.
- Using access utility -
Hive CLI. - Tables created as external (
external) to prevent data loss.
- Creating tables trips built on top of the existing data in the
csvformat. trips are partitioned by the day of the start of the trip, the storage format isparquet. Thus, the search for the necessary data in the table will take the shortest possible time. - Configure partitions, transformation and upload data to fact tables.
- Creating data showcase using a materialized view and
MAPJOIN. - Creating terminal scenario for showcase auto-creation.
- Rebuilding showcase.