Machine Learning Prerequisites (Numpy)

15 / 32

Numpy - Arrays - Loading a text file data using NumPy's genfromtxt() function

As we discussed earlier, there are two ways (constructs) in NumPy to load data from a text file:

(1) using loadtxt() function

(2) using genfromtxt() function

Below is an example of using genfromtxt() function

Example of genfromtxt()

genfromtxt() function is very helpful when you are expecting some missing values in the dataset to be loaded. Below is a sample code

import numpy as np
my_arr = np.genfromtxt('my_file.txt', skip_header=2, filling_values=9999999)

Here, if all your data in the dataset is of type integer then, by default, the string values are treated as missing values, and genfromtxt() function will replace these missing values (string values) with a nan value.

If you want the missing values to be replaced with some other value other than nan, then, you can specify this particular value in the filling_values parameter. For example, in the above code, we are saying that if any missing values found, please replace it with value 9999999.

genfromtxt() function also trims any white spaces around the values being loaded.

You can also specify if you want to load any maximum number of rows, in this case, only specified number of max. rows will be loaded.


Please follow the below steps:

(1) Please import the required libraries

import numpy as np
import os

(2) Please create a variable HOUSING_PATH and assign to it the path of housing.csv file ('/cxldata/datasets/project/housing') as a string

HOUSING_PATH = <<your code comes here>>

(3) Please define a complete path for your csv file housing.csv by using os.path.join() function, by passing to it the HOUSING_PATH and the csv file housing.csv, and save this complete path in a variable FILE.

FILE = os.path.join(HOUSING_PATH, <<your code comes here>>)

(4) Please define a function load_housing_dataset() and add to it the complete path of the csv file (FILE) just defined above. This function will load housing.csv file using genfromtxt() function.

def <<your code comes here>>(file =FILE ):
    return np.genfromtxt(file, dtype={'names': ('longitude','latitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income','median_house_value','ocean_proximity'),'formats': ('f8', 'f8', 'f8', 'f8', 'f8', 'f8', 'f8', 'f8', 'f8', '|S15')}, delimiter=',', skip_header=1, filling_values = 99999999, unpack=False)

genfromtxt() function parameters:

first parameter - name of the file from which the data is to be loaded.

second parameter - data type (dtype) of columns of the loaded csv file housing.csv. It is a Python dictionary with key as 'names' of the columns, and 'values' as the data types of these respective columns e.g. f8, |S15, etc.

'f8' means 64-bit floating-point number '|S15' -means a string of length of 15 characters

third parameter - delimiter. Character by which values in a row of our csv file are separated. For example, in our case values of a row of our csv file housing.csv are separated by ',' (comma)

fourth parameter - skiprows. You can specify here, how many initial rows of the csv file you want to skip loading. E.g. you may want to skip the first row of this csv file, as it may contain header information in the first row, which you may not want to load.

fifth parameter - unpack. Same meaning as explained in loadtxt() function chapter.

(5) Call the load_housing_dataset() function, as defined above, and store the output in a variable called result_arr

result_arr = <<your code comes here>>()

(6) Print the length (number of records) of result_arr

print(<<your code comes here>>)

(7) Print array result_arr to see its values.

print(<<your code comes here>>)

Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...