|
| 1 | +--- |
| 2 | +title: Mastering APIs for Data Collection |
| 3 | +sidebar_label: APIs |
| 4 | +description: "A deep dive into REST and GraphQL APIs: how to fetch, authenticate, and process external data for machine learning." |
| 5 | +tags: [apis, rest, graphql, json, data-engineering, python-requests] |
| 6 | +--- |
| 7 | + |
| 8 | +In the Data Engineering lifecycle, **APIs** are the "clean" way to collect data. Unlike web scraping, which is brittle and unstructured, APIs provide a contract-based method to access data that is versioned, documented, and usually delivered in machine-readable formats like JSON. |
| 9 | + |
| 10 | +## 1. How APIs Work: The Request-Response Cycle |
| 11 | + |
| 12 | +An API acts as a middleman between your ML pipeline and a remote server. You send a **Request** (a specific question) and receive a **Response** (the data answer). |
| 13 | + |
| 14 | +```mermaid |
| 15 | +sequenceDiagram |
| 16 | + participant Pipeline as ML Data Pipeline |
| 17 | + participant API as API Gateway |
| 18 | + participant Server as Data Server |
| 19 | + |
| 20 | + Pipeline->>API: HTTP Request (GET /data) |
| 21 | + Note right of Pipeline: Includes Headers & API Key |
| 22 | + API->>Server: Validate & Route |
| 23 | + Server-->>API: Data Payload |
| 24 | + API-->>Pipeline: HTTP Response (200 OK + JSON) |
| 25 | +
|
| 26 | +``` |
| 27 | + |
| 28 | +### Components of an API Request: |
| 29 | + |
| 30 | +1. **Endpoint (URL):** The address where the data lives (e.g., `api.twitter.com/v2/tweets`). |
| 31 | +2. **Method:** What you want to do (`GET` to fetch, `POST` to send). |
| 32 | +3. **Headers:** Metadata like your **API Key** or the format you want (`Content-Type: application/json`). |
| 33 | +4. **Parameters:** Filters for the data (e.g., `?start_date=2023-01-01`). |
| 34 | + |
| 35 | +## 2. Common API Architectures in ML |
| 36 | + |
| 37 | +### A. REST (Representational State Transfer) |
| 38 | + |
| 39 | +The most common architecture. It treats every piece of data as a "Resource." |
| 40 | + |
| 41 | +* **Best for:** Standardized data fetching. |
| 42 | +* **Format:** Almost exclusively **JSON**. |
| 43 | + |
| 44 | +### B. GraphQL |
| 45 | + |
| 46 | +Developed by Meta, it allows the client to define the structure of the data it needs. |
| 47 | + |
| 48 | +* **Advantage in ML:** If a user profile has 100 fields but you only need 3 features for your model, GraphQL prevents "Over-fetching," saving bandwidth and memory. |
| 49 | + |
| 50 | +[Image comparing REST vs GraphQL data fetching efficiency] |
| 51 | + |
| 52 | +### C. Streaming APIs (WebSockets/gRPC) |
| 53 | + |
| 54 | +Used when data needs to be delivered in real-time. |
| 55 | + |
| 56 | +* **ML Use Case:** Algorithmic trading or live social media sentiment monitoring. |
| 57 | + |
| 58 | +## 3. Implementation in Python |
| 59 | + |
| 60 | +The `requests` library is the standard tool for interacting with APIs. |
| 61 | + |
| 62 | +```python |
| 63 | +import requests |
| 64 | + |
| 65 | +url = "https://api.example.com/v1/weather" |
| 66 | +headers = { |
| 67 | + "Authorization": "Bearer YOUR_TOKEN" |
| 68 | +} |
| 69 | +params = { |
| 70 | + "city": "Mandsaur", |
| 71 | + "country": "IN", |
| 72 | + "units": "metric" |
| 73 | +} |
| 74 | + |
| 75 | +response = requests.get(url, headers=headers, params=params) |
| 76 | + |
| 77 | +if response.status_code == 200: |
| 78 | + data = response.json() |
| 79 | + temperature = data["main"]["temp"] # Extracting temperature |
| 80 | + humidity = data["main"]["humidity"] # Extracting humidity |
| 81 | + |
| 82 | + print(f"Temperature in Mandsaur: {temperature}°C") |
| 83 | + print(f"Humidity: {humidity}%") |
| 84 | +else: |
| 85 | + print("Failed to fetch weather data") |
| 86 | + |
| 87 | +``` |
| 88 | + |
| 89 | +## 4. Challenges: Rate Limiting and Status Codes |
| 90 | + |
| 91 | +APIs are not infinite resources. Providers implement **Rate Limiting** to prevent abuse. |
| 92 | + |
| 93 | +| Status Code | Meaning | Action for ML Pipeline | |
| 94 | +| --- | --- | --- | |
| 95 | +| **200** | OK | Process the data. | |
| 96 | +| **401** | Unauthorized | Check your API Key/Token. | |
| 97 | +| **404** | Not Found | Check your Endpoint URL. | |
| 98 | +| **429** | Too Many Requests | **Exponential Backoff:** Wait and try again later. | |
| 99 | + |
| 100 | +```mermaid |
| 101 | +flowchart TD |
| 102 | + Req[Send API Request] --> Res{Status Code?} |
| 103 | + Res -- 200 --> Save[Ingest to Database] |
| 104 | + Res -- 429 --> Wait[Wait/Sleep] --> Req |
| 105 | + Res -- 401 --> Fail[Alert Developer] |
| 106 | + style Wait fill:#fff3e0,stroke:#ef6c00,color:#333 |
| 107 | +
|
| 108 | +``` |
| 109 | + |
| 110 | +## 5. Authentication Methods |
| 111 | + |
| 112 | +1. **API Keys:** A simple string passed in the header. |
| 113 | +2. **OAuth 2.0:** A more secure, token-based system used by Google, Meta, and Twitter. |
| 114 | +3. **JWT (JSON Web Tokens):** Often used in internal microservices. |
| 115 | + |
| 116 | +## References for More Details |
| 117 | + |
| 118 | +* **[REST API Tutorial](https://restfulapi.net/):** Understanding the principles of RESTful design. |
| 119 | + |
| 120 | + |
| 121 | +* **[Python Requests Guide](https://requests.readthedocs.io/en/latest/):** Mastering HTTP requests for data collection. |
| 122 | + |
| 123 | +--- |
| 124 | + |
| 125 | +APIs give us structured data, but sometimes the "front door" is locked. When there is no API, we must use the more aggressive "side window" approach. |
0 commit comments