Based on the idea of Spotify : a concrete example to understand how graph databases work, with Neo4j. The challenge is to create a music recommendation algorithm, using a very large database of songs (Million Song Dataset) with a graphical interface (Symfony).
Take the time to read what I wrote here so you will understand what you do.
Datasets are a bunch of data in a certain form, but we have to convert them to match what Neo4j wants to ingest.
You can first download the data list of the songs. Then, extract it.
$> wget http://static.echonest.com/millionsongsubset_full.tar.gz $> tar -xvzf millionsongsubset_full.tar.gz
This will create a ./MillionSongSubset directory.
This is a ressource allowing us to get the list of the titles of the songs, along with many (many) data such as the bitrates of the musics or the related artists of the artist who created the music.
Each data of a song is stored inside its .h5 file.
This file format contains a folder and file tree containing data.
Roughly speaking, these files contain batches of data in tables (as in Excel).
You can use the"HDFView" software to see what these files actually contain.
The dataset downloaded previously comes with the .h5 files stored under multiple directories.
We store everything in the same directory to make the following scripts easier to process.
$> find -name "*.h5" -exec cp {} ../tools/DATASET_PROCESS/H5_FILES/ \;
The dataset provides the data of exactly 10 000 songs.
To be sure everything is in there, execute :
$> ls ./tools/DATASET_PROCESS/H5_FILES/ | wc -l
A Python script allows us to extract the data from the dataset, with the information we want (title, artists related, play time etc...).
Execute h5_to_ascii.sh to run the script that translate .h5 files to human-readable ASCII files.
$> sh ./tools/h5_to_ascii.sh
Everything will be stored under ./tools/DATASET_PROCESS/ASCII_FILES/.
Neo4j allows to import CSV files. But as the first script outputs only a ASCII text file, we have to format it in JSON and then in CSV.
$> sh ./tools/ascii_to_json.sh
Everything will be stored under ./tools/DATASET_PROCESS/JSON_FILES/.
This script as well concatenate the JSON files into a single file (in ./tools/DATASET_PROCESS/JSON_FILES/ALL_DATA_JSON.json) so we will easily convert it to CSV.
To convert the JSON we've outputed to CSV, we use an excellent website :
https://codebeautify.org/json-to-csv
Click the "Browse" button, select ./ALL_DATA_JSON.json and click "Download".
Don't forget to add the file on your server with the name : ALL_DATA_CSV.csv.
Get these files we've compiled inside :
./data/processed/artists_ids.csv ./data/processed/genres.csv
We've stepped into several problems while importing the csv data with our first algorithm. So to have a cleaner and slimmer import, we had to list the artists IDs in a single node to then link the similar artists of a music, to the music. Same thing for the genres.
Here are the steps to get the artist IDs:
Inside the downloaded song list directory : (by default ./MillionSongSubset/), is a file named ./MillionSongSubset/subset_artist_term.db.
This file is a SQLite database file.
We've just browsed this database with the SQLite browser and used the function "export", selecting only the artist_id column.
You can follow this official tutorial to install Neo4j for your Debian machine.
Inside /etc/neo4j/neo4j.conf :
# Uncomment : dbms.security.auth_enabled=false dbms.security.allow_csv_import_from_file_urls=true # Comment : #dbms.directories.import=/var/lib/neo4j/import
Restart Neo4j :
$> service neo4j restart
All the queries below are written in Cypher. Cypher is to Neo4j what SQL is to MySQL.
Access your browser instance of Neo4j with the following link. Replace localhost by your IP address if it is necessary.
http://localhost:7474/browser
Just before continuing, we have to increase the limit of 300 nodes display for Neo4j using this command in the Neo4j console :
Change 1000 by the number you want. But careful : it may make your browser crash.
:config initialNodeDisplay: 1000
Replace /home/user by the absolute file where you've cloned this git repository.
LOAD CSV WITH HEADERS FROM "file:/home/user/Neo4j-Example-Spotifylike/data/processed/artists_id.csv" AS csvLine
CREATE (a:Artist { artist_id: csvLine.artist_id })
Replace /home/user by the absolute file where you've cloned this git repository.
LOAD CSV WITH HEADERS FROM "file:/home/user/Neo4j-Example-Spotifylike/data/processed/genres.csv" AS csvLine
CREATE (g:Genre { name: csvLine.mbtag })
For this, we will use the ./data/processed/artist_genre.csv file.
Replace /home/user by the absolute file where you've cloned this git repository.
LOAD CSV WITH HEADERS FROM "file:/artist_genre.csv" AS csvLine
MATCH (a:Artist {artist_id:csvLine.artist_id}), (g:Genre {name: csvLine.mbtag})
MERGE (a)-[:HAS_GENRE]->(g)
Well, it is not just about importing the song list. The fact is that each music has a "similar_artists" property, which is really heavy and will overload our server for no reason.
To fix this, we will use the Artist nodes, and add a relation : music OWNED_BY artist.
LOAD CSV WITH HEADERS FROM "file:/ALL_DATA_CSV.csv" AS csvLine
// Creating the music node.
MERGE (m:Music {title: csvLine.title, duration: csvLine.duration})
WITH m, csvLine
MATCH (a:Artist {artist_id: csvLine.artist_id})
MERGE (a)-[:OWNS]->(m)
SET a.name = csvLine.artist_name
MERGE (y:Year {year: csvLine.year})
MERGE (m)-[:RELEASED_IN]->(y)
MERGE (al:Album {name: csvLine.album})
MERGE (m)-[:IN]->(al)
MERGE (a)-[:CREATED]->(al)
WITH a, m, csvLine
UNWIND split(csvLine.all_terms, ',') as genre_instance
MATCH (g:Genre {name: genre_instance})
MERGE (m)-[:HAS_GENRE]->(g)
WITH a, m, csvLine
UNWIND split(csvLine.similar_artists, ',') as asi
MATCH (as:Artist {artist_id: asi})
MERGE (a)-[:SIMILAR_TO]->(as)
RETURN count(*)
// LIMIT 5; // Limit the query if you computer is not really powerful.
ℹ️ You might experience some problems while importing a large quantity of data.
Use the following command at the beginning of the previous command to make it work. It will persist the data every 50 entity processed.
USING PERIODIC COMMIT 50
ℹ️ You might experience bugs of memory while importing the data. In /etc/neo4j/neo4j.conf, uncomment and modify the following line.
dbms.memory.heap.max_size=1024m
