FAQ

Questions and Answers

Where can I find data sets for practice?

At CloudxLab, the datasets are located at two places: HDFS (Hadoop Distributed File System) and Local File System.

HDFS

In HDFS, the location of datasets is /data. You can use the following command to check the location:

hadoop fs -ls /data

Below is the list of data sets available in HDFS

  • Numbers from 1 to 100000 /data/1lac
  • Chicago Crimes Data /data/Chicago_Crimes-2001-to-present.csv
  • New York stock exchange /data/NYSE_daily and /data/NYSE_dividends
  • Twitter tweets for IronMan movie /data/SentimentFiles/SentimentFiles/tweets_raw_full.zip
  • Amazon Reviews /data/amazon_review_full_csv
  • Loan Stats /data/clean_loan_stats_3c.csv
  • Common Phrases /data/common_phrases_pg10681.txt
  • files_baby_yahoo_yelp /data/contrib/case_study_files_baby_yahoo_yelp
  • files_batting /data/contrib/case_study_files_batting
  • files_elt /data/contrib/case_study_files_elt
  • files_fire /data/contrib/case_study_files_fire
  • files_hive_hdfs_to_ORC /data/contrib/case_study_files_hive_hdfs_to_ORC
  • files_hive_partition /data/contrib/case_study_files_hive_partition
  • files_hive_realEstate /data/contrib/case_study_files_hive_realEstate
  • files_hive_serde /data/contrib/case_study_files_hive_serde
  • files_hive_yellowTaxi /data/contrib/case_study_files_hive_yellowTaxi
  • files_iot /data/contrib/case_study_files_iot
  • files_pig_flight /data/contrib/case_study_files_pig_flight
  • files_survey /data/contrib/case_study_files_survey
  • sparksql_hbase /data/contrib/case_study_sparksql_hbase
  • sparksql_hive /data/contrib/case_study_sparksql_hive
  • sparkstreaming_kafka /data/contrib/case_study_sparkstreaming_kafka
  • Tennis Events /data/events_tennis.csv
  • google-10000-english /data/google-10000-english
  • Loan Stats /data/loan_stats_3c.csv
  • Location Geocode /data/location_geocode.csv
  • log Messages /data/log_messages
  • Movie lens - 100k records /data/ml-100k
  • Movie lens - 1M records /data/ml-1m
  • Crypto Transactions Data /data/msprojects
  • person_wine_3 /data/person_wine_3.txt
  • points_tennis /data/points_tennis.csv
  • Tennis Rallies /data/rallies_tennis.csv
  • sales-drivers-timesheets /data/sales-drivers-timesheets
  • users_links_1000_kv /data/users_links_1000_kv.tsv
  • Zomato /data/zomato.csv
  • Apache access logs from NASA Kennedy Space Center /data/spark/project/NASA_access_log_Aug95.gz

Local File System

The datasets are available at /cxldata. You can check using ls /cxldata

Here is a brief listing. Though the names are self-explanatory, feel free to poke around the datasets.

  • /cxldata/datasets/bootml/Facebook_metrices_1
  • /cxldata/datasets/bootml/Absenteeism_dataset_1
  • /cxldata/datasets/bootml/Forest_Fires_Data_Set
  • /cxldata/datasets/bootml/Bikes_Data_1
  • /cxldata/datasets/bootml/Wine_quality_dataset_1
  • /cxldata/datasets/bootml/Housing_California_1
  • /cxldata/datasets/bootml/Melbourne_land_dataset_1
  • /cxldata/datasets/bootml/kernel_performance_Data_Set__1
  • /cxldata/datasets/bootml/Protein_dataset_1
  • /cxldata/datasets/bootml/Student_performance_1
  • /cxldata/datasets/project/mnist
  • /cxldata/datasets/project/global-wheat-detection
  • /cxldata/datasets/project/ny_stock_prediction
  • /cxldata/datasets/project/rain-gauge
  • /cxldata/datasets/project/fashion-mnist
  • /cxldata/datasets/project/housing
  • /cxldata/datasets/project/cat-non-cat
  • /cxldata/datasets/project/titanic
  • /cxldata/dlcourse/imdb_reviews
  • /cxldata/embedding/glove/
  • /cxldata/embedding/word2vec/GoogleNews-vectors-negative300.bin
  • /cxldata/findata/mstf.csv
  • /cxldata/gle/expedia_train.csv
  • /cxldata/gle/usersha1-profile.tsv
  • /cxldata/gle/usersha1-artmbid-artname-plays.tsv
  • /cxldata/pet_mle/user_data.csv
  • /cxldata/pet_mle/car_data_2.csv
  • /cxldata/pet_mle/wine_quality.csv
  • /cxldata/pet_mle/time_series_data.csv
  • /cxldata/pet_mle/car_data_1.csv
  • /cxldata/pet_mle/stock_fundamentals.csv
  • /cxldata/projects/predict-future-sales
  • /cxldata/projects/image-class
  • /cxldata/projects/yolov4
  • /cxldata/projects/lookalikeceleb
  • /cxldata/py/mbox-short.txt
  • /cxldata/r/wages_education.csv
  • /cxldata/skin_disease_1

If you need to work upon certain data and you want us to upload it in the lab, please let us know at reachus@cloudxlab.com. If this is useful for more than 5% of users, we would like to upload it.