At CloudxLab, the datasets are located at two places: HDFS (Hadoop Distributed File System) and Local File System.
HDFS
In HDFS, the location of datasets is /data. You can use the following command to check the location:
hadoop fs -ls /data
Below is the list of data sets available in HDFS
- Numbers from 1 to 100000 /data/1lac
- Chicago Crimes Data /data/Chicago_Crimes-2001-to-present.csv
- New York stock exchange /data/NYSE_daily and /data/NYSE_dividends
- Twitter tweets for IronMan movie /data/SentimentFiles/SentimentFiles/tweets_raw_full.zip
- Amazon Reviews /data/amazon_review_full_csv
- Loan Stats /data/clean_loan_stats_3c.csv
- Common Phrases /data/common_phrases_pg10681.txt
- files_baby_yahoo_yelp /data/contrib/case_study_files_baby_yahoo_yelp
- files_batting /data/contrib/case_study_files_batting
- files_elt /data/contrib/case_study_files_elt
- files_fire /data/contrib/case_study_files_fire
- files_hive_hdfs_to_ORC /data/contrib/case_study_files_hive_hdfs_to_ORC
- files_hive_partition /data/contrib/case_study_files_hive_partition
- files_hive_realEstate /data/contrib/case_study_files_hive_realEstate
- files_hive_serde /data/contrib/case_study_files_hive_serde
- files_hive_yellowTaxi /data/contrib/case_study_files_hive_yellowTaxi
- files_iot /data/contrib/case_study_files_iot
- files_pig_flight /data/contrib/case_study_files_pig_flight
- files_survey /data/contrib/case_study_files_survey
- sparksql_hbase /data/contrib/case_study_sparksql_hbase
- sparksql_hive /data/contrib/case_study_sparksql_hive
- sparkstreaming_kafka /data/contrib/case_study_sparkstreaming_kafka
- Tennis Events /data/events_tennis.csv
- google-10000-english /data/google-10000-english
- Loan Stats /data/loan_stats_3c.csv
- Location Geocode /data/location_geocode.csv
- log Messages /data/log_messages
- Movie lens - 100k records /data/ml-100k
- Movie lens - 1M records /data/ml-1m
- Crypto Transactions Data /data/msprojects
- person_wine_3 /data/person_wine_3.txt
- points_tennis /data/points_tennis.csv
- Tennis Rallies /data/rallies_tennis.csv
- sales-drivers-timesheets /data/sales-drivers-timesheets
- users_links_1000_kv /data/users_links_1000_kv.tsv
- Zomato /data/zomato.csv
- Apache access logs from NASA Kennedy Space Center /data/spark/project/NASA_access_log_Aug95.gz
Local File System
The datasets are available at /cxldata. You can check using ls /cxldata
Here is a brief listing. Though the names are self-explanatory, feel free to poke around the datasets.
- /cxldata/datasets/bootml/Facebook_metrices_1
- /cxldata/datasets/bootml/Absenteeism_dataset_1
- /cxldata/datasets/bootml/Forest_Fires_Data_Set
- /cxldata/datasets/bootml/Bikes_Data_1
- /cxldata/datasets/bootml/Wine_quality_dataset_1
- /cxldata/datasets/bootml/Housing_California_1
- /cxldata/datasets/bootml/Melbourne_land_dataset_1
- /cxldata/datasets/bootml/kernel_performance_Data_Set__1
- /cxldata/datasets/bootml/Protein_dataset_1
- /cxldata/datasets/bootml/Student_performance_1
- /cxldata/datasets/project/mnist
- /cxldata/datasets/project/global-wheat-detection
- /cxldata/datasets/project/ny_stock_prediction
- /cxldata/datasets/project/rain-gauge
- /cxldata/datasets/project/fashion-mnist
- /cxldata/datasets/project/housing
- /cxldata/datasets/project/cat-non-cat
- /cxldata/datasets/project/titanic
- /cxldata/dlcourse/imdb_reviews
- /cxldata/embedding/glove/
- /cxldata/embedding/word2vec/GoogleNews-vectors-negative300.bin
- /cxldata/findata/mstf.csv
- /cxldata/gle/expedia_train.csv
- /cxldata/gle/usersha1-profile.tsv
- /cxldata/gle/usersha1-artmbid-artname-plays.tsv
- /cxldata/pet_mle/user_data.csv
- /cxldata/pet_mle/car_data_2.csv
- /cxldata/pet_mle/wine_quality.csv
- /cxldata/pet_mle/time_series_data.csv
- /cxldata/pet_mle/car_data_1.csv
- /cxldata/pet_mle/stock_fundamentals.csv
- /cxldata/projects/predict-future-sales
- /cxldata/projects/image-class
- /cxldata/projects/yolov4
- /cxldata/projects/lookalikeceleb
- /cxldata/py/mbox-short.txt
- /cxldata/r/wages_education.csv
- /cxldata/skin_disease_1
If you need to work upon certain data and you want us to upload it in the lab, please let us know at reachus@cloudxlab.com. If this is useful for more than 5% of users, we would like to upload it.