Skip to content

Commit eb0cced

Browse files
committed
added more content
1 parent 242533d commit eb0cced

File tree

3 files changed

+317
-0
lines changed

3 files changed

+317
-0
lines changed
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
---
2+
title: "JSON: The Semi-Structured Standard"
3+
sidebar_label: JSON
4+
description: "Mastering JSON for Machine Learning: handling nested data, converting dictionaries, and efficient parsing for NLP pipelines."
5+
tags: [data-engineering, json, api, semi-structured-data, python, nlp]
6+
---
7+
8+
**JSON (JavaScript Object Notation)** is a lightweight, text-based format for storing and transporting data. While CSVs are perfect for simple tables, JSON excels at representing **hierarchical** or **nested** data—where one observation might contain lists or other sub-observations.
9+
10+
## 1. JSON Syntax vs. Python Dictionaries
11+
12+
JSON structure is almost identical to a Python dictionary. It uses key-value pairs and supports several data types:
13+
14+
* **Objects:** Enclosed in `{}` (Maps to Python `dict`).
15+
* **Arrays:** Enclosed in `[]` (Maps to Python `list`).
16+
* **Values:** Strings, Numbers, Booleans (`true`/`false`), and `null`.
17+
18+
```json
19+
{
20+
"user_id": 101,
21+
"metadata": {
22+
"login_count": 5,
23+
"tags": ["premium", "active"]
24+
},
25+
"is_active": true
26+
}
27+
28+
```
29+
30+
## 2. Why JSON is Critical for ML
31+
32+
### A. Natural Language Processing (NLP)
33+
34+
Text data often comes with complex metadata (author, timestamp, geolocation, and nested entity tags). JSON allows all this info to stay bundled with the raw text.
35+
36+
### B. Configuration Files
37+
38+
Most ML frameworks use JSON (or its cousin, YAML) to store **Hyperparameters**.
39+
40+
```json
41+
{
42+
"model": "ResNet-50",
43+
"learning_rate": 0.001,
44+
"optimizer": "Adam"
45+
}
46+
47+
```
48+
49+
### C. API Responses
50+
51+
As discussed in the [APIs section](/tutorial/machine-learning/data-engineering-basics/data-collection/apis), almost every web service returns data in JSON format.
52+
53+
## 3. The "Flattening" Problem
54+
55+
Machine Learning models (like Linear Regression or XGBoost) require **flat** 2D arrays (Rows and Columns). They cannot "see" inside a nested JSON object. Data engineers must **Flatten** or **Normalize** the data.
56+
57+
```mermaid
58+
graph LR
59+
Nested[Nested JSON] --> Normalize["pd.json_normalize()"]
60+
Normalize --> Flat[Flat DataFrame]
61+
style Normalize fill:#f3e5f5,stroke:#7b1fa2,color:#333
62+
63+
```
64+
65+
**Example in Python:**
66+
67+
```python
68+
import pandas as pd
69+
import json
70+
71+
raw_json = [
72+
{"name": "Alice", "info": {"age": 25, "city": "NY"}},
73+
{"name": "Bob", "info": {"age": 30, "city": "SF"}}
74+
]
75+
76+
# Flattens 'info' into 'info.age' and 'info.city' columns
77+
df = pd.json_normalize(raw_json)
78+
79+
```
80+
81+
## 4. Performance Trade-offs
82+
83+
| Feature | JSON | CSV | Parquet |
84+
| --- | --- | --- | --- |
85+
| **Flexibility** | **Very High** (Schema-less) | Low (Fixed Columns) | Medium (Evolving Schema) |
86+
| **Parsing Speed** | Slow (Heavy string parsing) | Medium | **Very Fast** |
87+
| **File Size** | Large (Repeated Keys) | Medium | Small (Binary) |
88+
89+
:::note
90+
In a JSON file, the key (e.g., `"user_id"`) is repeated for every single record, which wastes a lot of disk space compared to CSV.
91+
:::
92+
93+
## 5. JSONL: The Big Data Variant
94+
95+
Standard JSON files require you to load the entire file into memory to parse it. For datasets with millions of records, we use **JSONL (JSON Lines)**.
96+
97+
* Each line in the file is a separate, valid JSON object.
98+
* **Benefit:** You can stream the file line-by-line without crashing your RAM.
99+
100+
```text
101+
{"id": 1, "text": "Hello world"}
102+
{"id": 2, "text": "Machine Learning is fun"}
103+
104+
```
105+
106+
## 6. Best Practices for ML Engineers
107+
108+
1. **Validation:** Use JSON Schema to ensure the data you're ingesting hasn't changed structure.
109+
2. **Encoding:** Always use `UTF-8` to avoid character corruption in text data.
110+
3. **Compression:** Since JSON is text-heavy, always use `.gz` or `.zip` when storing raw JSON files to save up to 90% space.
111+
112+
## References for More Details
113+
114+
* **[Python `json` Module](https://docs.python.org/3/library/json.html):** Learning `json.loads()` and `json.dumps()`.
115+
116+
* **[Pandas `json_normalize` Guide](https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html):** Mastering complex flattening of API data.
117+
118+
---
119+
120+
JSON is the king of flexibility, but for "Big Data" production environments where speed and storage are everything, we move to binary formats.
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
---
2+
title: "Parquet: The Big Data Gold Standard"
3+
sidebar_label: Parquet
4+
description: "Understanding Columnar storage, compression benefits, and why Parquet is the preferred format for high-performance ML pipelines."
5+
tags: [data-engineering, parquet, big-data, columnar-storage, performance, cloud-storage]
6+
---
7+
8+
**Apache Parquet** is an open-source, column-oriented data file format designed for efficient data storage and retrieval. Unlike CSV or JSON, which store data row-by-row, Parquet organizes data by **columns**. This single architectural shift makes it the industry standard for modern data lakes and ML feature stores.
9+
10+
## 1. Row-based vs. Columnar Storage
11+
12+
To understand Parquet, you must understand the difference in how data is laid out on your hard drive.
13+
14+
* **Row-based (CSV/SQL):** Stores all data for "User 1," then all data for "User 2."
15+
* **Columnar (Parquet):** Stores all "User IDs" together, then all "Ages" together, then all "Incomes" together.
16+
17+
18+
```mermaid
19+
graph LR
20+
subgraph Row_Storage [Row-Based: CSV]
21+
R1[Row 1: ID, Age, Income]
22+
R2[Row 2: ID, Age, Income]
23+
end
24+
25+
subgraph Col_Storage [Column-Based: Parquet]
26+
C1[IDs: 1, 2, 3...]
27+
C2[Ages: 25, 30, 35...]
28+
C3[Incomes: 50k, 60k...]
29+
end
30+
31+
```
32+
33+
## 2. Why Parquet is Superior for ML
34+
35+
### A. Column Projection (Selective Reading)
36+
37+
In ML, you might have a dataset with 500 columns, but your specific model only needs 5 features.
38+
39+
* **CSV:** You must read the entire file into memory to get those 5 columns.
40+
* **Parquet:** The system "jumps" directly to the 5 columns you need and skips the other 495. This reduces I/O by over 90%.
41+
42+
### B. Drastic Compression
43+
44+
Because Parquet stores similar data types together, it can use highly efficient compression algorithms (like Snappy or Gzip).
45+
46+
* **Example:** In an "Age" column, numbers are similar. Parquet can store "30, 30, 30, 31" as "3x30, 1x31" (**Run-Length Encoding**).
47+
48+
### C. Schema Preservation
49+
50+
Parquet is a binary format that stores **metadata**. It "knows" that a column is a 64-bit float or a Timestamp. You never have to worry about a "Date" column being accidentally read as a string.
51+
52+
## 3. Parquet vs. CSV: The Benchmarks
53+
54+
| Feature | CSV | Parquet |
55+
| --- | --- | --- |
56+
| **Storage Size** | 1.0x (Large) | **~0.2x (Small)** |
57+
| **Query Speed** | Slow | **Very Fast** |
58+
| **Cost (Cloud)** | Expensive (S3 scans more data) | **Cheap** (S3 scans less data) |
59+
| **ML Readiness** | Requires manual type casting | **Plug-and-play** |
60+
61+
## 4. Using Parquet in Python
62+
63+
Pandas and PyArrow make it easy to switch from CSV to Parquet.
64+
65+
```python
66+
import pandas as pd
67+
68+
# Saving a dataframe to Parquet
69+
# Requires 'pyarrow' or 'fastparquet' installed
70+
df.to_parquet('large_dataset.parquet', compression='snappy')
71+
72+
# Reading only specific columns (The magic of Parquet!)
73+
df_subset = pd.read_parquet('large_dataset.parquet', columns=['feature_1', 'target'])
74+
75+
```
76+
77+
## 5. When to use Parquet
78+
79+
1. **Production Pipelines:** Always use Parquet for data passed between different stages of a pipeline.
80+
2. **Large Datasets:** If your data is MB, the speed gains become obvious.
81+
3. **Cloud Storage:** If storing data in AWS S3 or Google Cloud Storage, Parquet will save you significant money on data egress/scan costs.
82+
83+
## References for More Details
84+
85+
* **[Apache Parquet Official Documentation](https://parquet.apache.org/):** Deep diving into the binary file structure.
86+
87+
* **[Databricks - Why Parquet?](https://www.databricks.com/glossary/what-is-parquet)** Understanding Parquet's role in the "Lakehouse" architecture.
88+
89+
---
90+
91+
Parquet is the king of analytical data storage. However, some streaming applications require a format that is optimized for high-speed row writes rather than column reads.
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
---
2+
title: "XML: Extensible Markup Language"
3+
sidebar_label: XML
4+
description: "Handling hierarchical data in XML: parsing techniques, its role in Computer Vision annotations, and converting XML to ML-ready formats."
5+
tags: [data-engineering, xml, data-formats, computer-vision, pascal-voc, web-services]
6+
---
7+
8+
**XML** is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. While JSON has largely replaced XML for web APIs, XML remains a cornerstone in industrial systems and **Object Detection** datasets.
9+
10+
## 1. Anatomy of an XML Document
11+
12+
XML uses a tree-like structure consisting of **tags**, **attributes**, and **content**.
13+
14+
```xml
15+
<annotation>
16+
<filename>image_01.jpg</filename>
17+
<size>
18+
<width>640</width>
19+
<height>480</height>
20+
</size>
21+
<object>
22+
<name>cat</name>
23+
<bndbox>
24+
<xmin>100</xmin>
25+
<ymin>120</ymin>
26+
<xmax>250</xmax>
27+
<ymax>300</ymax>
28+
</bndbox>
29+
</object>
30+
</annotation>
31+
32+
```
33+
34+
## 2. XML in Machine Learning: Use Cases
35+
36+
### A. Computer Vision (Pascal VOC)
37+
38+
One of the most famous datasets in ML history, **Pascal VOC**, uses XML files to store the coordinates of bounding boxes for image classification and detection.
39+
40+
### B. Enterprise Data Integration
41+
42+
Many older banking, insurance, and manufacturing systems exchange data exclusively via XML over SOAP (Simple Object Access Protocol).
43+
44+
### C. Configuration & Metadata
45+
46+
XML is often used to store metadata for scientific datasets where complex, nested relationships must be strictly defined by a **Schema (XSD)**.
47+
48+
## 3. Parsing XML in Python
49+
50+
Because XML is a tree, we don't read it like a flat file. We "traverse" the tree using libraries like `ElementTree` or `lxml`.
51+
52+
```python
53+
import xml.etree.ElementTree as ET
54+
55+
tree = ET.parse('annotation.xml')
56+
root = tree.getroot()
57+
58+
# Accessing specific data
59+
filename = root.find('filename').text
60+
for obj in root.findall('object'):
61+
name = obj.find('name').text
62+
print(f"Detected object: {name}")
63+
64+
```
65+
66+
## 4. XML vs. JSON
67+
68+
| Feature | XML | JSON |
69+
| --- | --- | --- |
70+
| **Metadata** | Supports Attributes + Elements | Only Key-Value pairs |
71+
| **Strictness** | High (Requires XSD validation) | Low (Flexible) |
72+
| **Size** | Verbose (Closing tags increase size) | Compact |
73+
| **Readability** | High (Document-centric) | High (Data-centric) |
74+
75+
## 5. The Challenge: Deep Nesting
76+
77+
Just like [JSON](/tutorial/machine-learning/data-engineering-basics/data-formats/json), XML is hierarchical. To use it in a standard ML model (like a Random Forest), you must **Flatten** the tree into a table.
78+
79+
```mermaid
80+
graph TD
81+
XML[XML Root] --> Branch1[Branch: Metadata]
82+
XML --> Branch2[Branch: Observations]
83+
Branch2 --> Leaf[Leaf: Data Point]
84+
Leaf --> Flatten[Flattening Logic]
85+
Flatten --> CSV[2D Feature Matrix]
86+
87+
style XML fill:#f3e5f5,stroke:#7b1fa2,color:#333
88+
style CSV fill:#e1f5fe,stroke:#01579b,color:#333
89+
90+
```
91+
92+
## 6. Best Practices
93+
94+
1. **Use `lxml` for Speed:** The built-in `ElementTree` is fine for small files, but `lxml` is significantly faster for processing large datasets.
95+
2. **Beware of "XML Bombs":** Malicious XML files can use entity expansion to crash your parser (DoS attack). Use **defusedxml** if you are parsing untrusted data from the web.
96+
3. **Schema Validation:** Always validate your XML against an `.xsd` file if available to ensure your ML pipeline doesn't break due to a missing tag.
97+
98+
99+
## References for More Details
100+
101+
* **[Python ElementTree Documentation](https://docs.python.org/3/library/xml.etree.elementtree.html):** Learning the standard library approach.
102+
* **[Pascal VOC Dataset Format](https://www.google.com/search?q=http://host.robots.ox.ac.uk/pascal/VOC/):** Seeing how XML is used in real-world ML projects.
103+
104+
---
105+
106+
XML completes our look at "Text-Based" formats. While these are great for humans to read, they are slow for machines to process. Next, we look at the high-speed binary formats used in Big Data.

0 commit comments

Comments
 (0)