9. Data mining

Processing the data from a large file and finding patterns in it is known as data-mining. Data mining required lots of data cleaning and data transformation operations. In this section, we will see some of these operations.

9.1. Building data structure from file

In the previous chapters, we read the data from the file and then process the data. In this section, we will save the data in a list/dictionary and then use these data structure to process the data. We will see that the data processing operations become easier when the data is converted in the form of dictionary.

Important

In previous chapter, we read the file and calculated the total prices inside the function ‘ring_cost’. The problem with this method is that the read data is no longer available (after the return statement) for the further processing.

Therefore, it is good idea to save the results in a list or dictionary, so that it will be available for other functions as well, as shown in this chapter.

Lets see the contents of the ‘price.csv’ file again,

$ cat price.csv
date,metal,radius,price,quantity
"2016-06-12","Gold",5.5,80.99,1
"2015-07-13","Silver",40.3,5.5,3
"2016-01-21","Iron",9.2,14.29,8
"2014-03-23","Gold",8,120.3,2
"2017-09-11","Copper",4.1,70.25,12
"2011-01-20","Iron",3.25,10.99,3

9.1.1. Save and read data in list

Create a new file ‘datamine.py’ with following contents. These contents are same as Listing 8.1, except it returns the list. Also, a check is applied for input ‘mode’ value.

# datamine.py

import csv

def read_file(filename, mode='warn'):
    ''' read csv file and save data in the list '''

    # check for correct mode
    if mode not in ['warn', 'silent', 'stop']:
        raise ValueError("possible modes are 'warn', 'silent', 'stop'")

    ring_data = [] # create empty list to save data

    with open (filename, 'r') as f:
        rows =  csv.reader(f)
        header = next(rows) # skip the header

        # change the types of the columns
        for row in rows:
            try:
                row[2] = float(row[2]) # radius
                row[3] = float(row[3]) # price
                row[4] = int(row[4]) # quantity
            except ValueError as err: # process value error only
                if mode == 'warn':
                    print("Invalid data, row is skipped")
                    print('Row: {}, Reason : {}'.format(row_num, err))
                elif mode == 'silent':
                    pass # do nothing
                elif mode == 'stop':
                    raise # raise the exception
                continue

            # append data in list in the form of tuple
            ring_data.append(tuple(row))

    return ring_data

def main():
    ring_data = read_file('price.csv')

    # total rows in the file
    print("Total rows: ", len(ring_data))

    # total price calculation
    total_price = 0
    for row in ring_data:
        total_price += row[3] * row[4]
    print("Total price: {:10.2f}".format(total_price))

if __name__ == '__main__':
    main()

Run the above code and we will get the following results,

$ python datamine.py
Total rows:  6
Total price:    1328.38

Now, open the Python shell and run the below code. See the difference, “Previously we returned total_price from the function, therefore we could perform no more operation on the data. But, now we have the data in the form of List, therefore we can perform operation on the data.

   >>> from datamine import read_file
   >>> ring_data = read_file('price.csv')
   >>> len(ring_data)
   6
   >>> ring_data[0]
   ('2016-06-12', 'Gold', 5.5, 80.99, 1)

   >>> for data in ring_data: # print metal with radius > 9
   ...     if data[2] > 9:
   ...             print("Metal: {0}, Radius: {1}".format(data[1], data[2]))
   ...
   Metal: Silver, Radius: 40.3
   Metal: Iron, Radius: 9.2


.. `data_in_dict`:

9.1.2. Save and read data in Dictionary

In the previous section, the list is read and data is printed (i.e. name of metal when radius > 9). It worked fine there, but when we have a large number of columns in the list, then it is very difficult to locate the elements using positions e.g. ‘data[2]’. For easy referencing, a dictionary can be used as shown below.

Note that, at line 58, the elements are located by the name, i.e. row[‘price’], which is easier to handle than using index e.g. row[3].

# datamine.py

import csv

def read_file(filename, mode='warn'):
    ''' read csv file and save data in the list '''

    # check for correct mode
    if mode not in ['warn', 'silent', 'stop']:
        raise ValueError("possible modes are 'warn', 'silent', 'stop'")

    ring_data = [] # create empty list to save data

    with open (filename, 'r') as f:
        rows =  csv.reader(f)
        header = next(rows) # skip the header

        # change the types of the columns
        for row in rows:
            try:
                row[2] = float(row[2]) # radius
                row[3] = float(row[3]) # price
                row[4] = int(row[4]) # quantity
            except ValueError as err: # process value error only
                if mode == 'warn':
                    print("Invalid data, row is skipped")
                    print('Row: {}, Reason : {}'.format(row_num, err))
                elif mode == 'silent':
                    pass # do nothing
                elif mode == 'stop':
                    raise # raise the exception
                continue

            # ring_data.append(tuple(row))

            # append data in list in the form of tuple
            row_dict = {
                    'date' : row[0],
                    'metal' : row[1],
                    'radius' : row[2],
                    'price' : row[3],
                    'quantity' : row[4]
                }

            ring_data.append(row_dict)

    return ring_data

def main():
    ring_data = read_file('price.csv')

    # total rows in the file
    print("Total rows: ", len(ring_data))

    # total price calculation
    total_price = 0
    for row in ring_data:
        total_price += row['price'] * row['quantity']
    print("Total price: {:10.2f}".format(total_price))

if __name__ == '__main__':
    main()

Following is the output of above code,

$ python datamine.py
Total rows:  6
Total price:    1328.38

9.2. List comprehension

In previous section, we read the data from the file and stored in the list/dictionary to perform further operations.

In this section, we will extract a specific type of data and store them in a new list. Let’s do it in the Python shell as below,

9.2.1. Basic method for extraction

In the below code, if-statement along with the ‘loop’ is used to extract the desired data, i.e. radius < 5.

>>> from datamine import read_file
>>> ring_data = read_file('price.csv')
>>> small_ring = []
>>> for ring in ring_data:
...     if ring['radius'] < 5: # store radius < 5
...         small_ring.append((ring['metal'], ring['radius'],  ring['price']))
...
>>> for ring in small_ring: # display content of small_ring
...     print(ring)
...
('Copper', 4.1, 70.25)
('Iron', 3.25, 10.99)

9.2.2. List comprehension for extraction

Operation in above section, i.e. if statement with loop, is very common, therefore Python provide a way to do it in one line, which is known as ‘list comprehension’, as shown below,

>>> from datamine import read_file
>>> ring_data = read_file('price.csv')
>>> small_ring = []
>>> small_ring = [(ring['metal'], ring['radius'], ring['price'])
...     for ring in ring_data if ring['radius'] < 5 ]
>>>
>>> for ring in small_ring:
...     print(ring)
...
('Copper', 4.1, 70.25)
('Iron', 3.25, 10.99)

9.2.3. Lambda operator

In numref:data_in_dict, the data was saved in the dictionary and then a specific type of data is extraced in above setion. In this section, we will sort the data store in the dictionary.

9.2.4. Basis method for sorting

Let’s do it in the Python shell as shown below. First see the content of the dictionary again,

>>> from datamine import read_file
>>> ring_data = read_file('price.csv')
>>> for data in ring_data:
...     print(data)
...
{'date': '2016-06-12', 'metal': 'Gold', 'radius': 5.5,
    'price': 80.99, 'quantity': 1}

{'date': '2015-07-13', 'metal': 'Silver', 'radius': 40.3,
    'price': 5.5, 'quantity': 3}

{'date': '2016-01-21', 'metal': 'Iron', 'radius': 9.2,
    'price': 14.29, 'quantity': 8}

{'date': '2014-03-23', 'metal': 'Gold', 'radius': 8.0,
    'price': 120.3, 'quantity': 2}

{'date': '2017-09-11', 'metal': 'Copper', 'radius': 4.1,
    'price': 70.25, 'quantity': 12}

{'date': '2011-01-20', 'metal': 'Iron', 'radius': 3.25,
    'price': 10.99, 'quantity': 3}

Note that, unlike list, we can not perform the sort() operation on dictionary. We will have following error,

>>> ring_data.sort()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'dict' and 'dict'

To sort the data in dictionary, we need to provide a ‘key’ to the sort() method. The key can be extracted from the dictionary using a function. Then this function should be called using sort() method as shown below,

Note

sort() method has the ‘key’ argument i.e. it is not user-defined name.

>>> def sort_name(all_data):
...     return all_data['metal']
...
>>>
>>> ring_data.sort(key=sort_name)  # sort the data
>>> for data in ring_data:
...     print(data)
...
{'date': '2017-09-11', 'metal': 'Copper', 'radius': 4.1,
    'price': 70.25, 'quantity': 12}

{'date': '2016-06-12', 'metal': 'Gold', 'radius': 5.5,
    'price': 80.99, 'quantity': 1}

{'date': '2014-03-23', 'metal': 'Gold', 'radius': 8.0,
    'price': 120.3, 'quantity': 2}

{'date': '2016-01-21', 'metal': 'Iron', 'radius': 9.2,
    'price': 14.29, 'quantity': 8}

{'date': '2011-01-20', 'metal': 'Iron', 'radius': 3.25,
    'price': 10.99, 'quantity': 3}

{'date': '2015-07-13', 'metal': 'Silver', 'radius': 40.3,
    'price': 5.5, 'quantity': 3}

9.2.5. Lambda operator

The problem with above method is that we need to define a one line function for each key e.g. ‘metal’, ‘price’ and ‘radius’ etc. (if we want to sort on every key), which is not desirable coding-style.

Lambda operators are the one operators which can be used to replace the one line function. Let’s see some example of Lambda operator first,

>>> sq = lambda x : x**2  # one line function for x**2
>>> sq(3)
9

>>> sum2Num = lambda x, y : x + y # lambda operator with two variable
>>> sum2Num(3, 4)
7

Note

In ‘sq = lambda x : x**2’, the ‘x’ is the input argument to function and the value after ‘:’, i.e. x**2, is the return value. And ‘sq’ is the name of the function i.e. the statement is equivalent to below code,

def sq(x):
    return x**2

Now, we will use the sort() method using lambda operator as shown below,

>>> ring_data.sort(key=lambda all_data : all_data['metal'])
>>> for data in ring_data:
...     print(data)
...
{'date': '2017-09-11', 'metal': 'Copper', 'radius': 4.1,
    'price': 70.25, 'quantity': 12}

{'date': '2016-06-12', 'metal': 'Gold', 'radius': 5.5,
    'price': 80.99, 'quantity': 1}

{'date': '2014-03-23', 'metal': 'Gold', 'radius': 8.0,
    'price': 120.3, 'quantity': 2}

{'date': '2016-01-21', 'metal': 'Iron', 'radius': 9.2,
    'price': 14.29, 'quantity': 8}

{'date': '2011-01-20', 'metal': 'Iron', 'radius': 3.25,
    'price': 10.99, 'quantity': 3}

{'date': '2015-07-13', 'metal': 'Silver', 'radius': 40.3,
    'price': 5.5, 'quantity': 3}

9.3. Find and arrange Gold rings

Let’s add List comprehension and Lambda operator in the file ‘datamine.py’. In the below code, the Gold rings are extracted first; and then the rings are arranged in decreasing order according to radius.

Listing 9.1 Find and arrange Gold rings
 # datamine.py

 import csv

 def read_file(filename, mode='warn'):
     ''' read csv file and save data in the list '''

     # check for correct mode
     if mode not in ['warn', 'silent', 'stop']:
         raise ValueError("possible modes are 'warn', 'silent', 'stop'")

     ring_data = [] # create empty list to save data

     with open (filename, 'r') as f:
         rows =  csv.reader(f)
         header = next(rows) # skip the header

         # change the types of the columns
         for row in rows:
             try:
                 row[2] = float(row[2]) # radius
                 row[3] = float(row[3]) # price
                 row[4] = int(row[4]) # quantity
             except ValueError as err: # process value error only
                 if mode == 'warn':
                     print("Invalid data, row is skipped")
                     print('Row: {}, Reason : {}'.format(row_num, err))
                 elif mode == 'silent':
                     pass # do nothing
                 elif mode == 'stop':
                     raise # raise the exception
                 continue

             # ring_data.append(tuple(row))

             # append data in list in the form of tuple
             row_dict = {
                     'date' : row[0],
                     'metal' : row[1],
                     'radius' : row[2],
                     'price' : row[3],
                     'quantity' : row[4]
                 }

             ring_data.append(row_dict)

     return ring_data

 def main():
     ring_data = read_file('price.csv')

     # # total rows in the file
     # print("Total rows: ", len(ring_data))

     # # total price calculation
     # total_price = 0
     # for row in ring_data:
         # total_price += row['price'] * row['quantity']
     # print("Total price: {:10.2f}".format(total_price))

     # extract Gold-ring : using List comprehension
     gold_ring = [ring for ring in ring_data if ring['metal'] == 'Gold']
     for ring in gold_ring: # print metal and radius
         print("Metal: {0}, Radius: {1}".format(ring['metal'], ring['radius']))

     # reverse-sort the data in gold_ring : using Lambda operators
     gold_ring.sort(key=lambda data : data['radius'], reverse=True)
     print("\nRadius in descending order:")
     for ring in gold_ring: # print metal and radius
         print("Metal: {0}, Radius: {1}".format(ring['metal'], ring['radius']))

 if __name__ == '__main__':
     main()

Now execute the file and we will get the following results,

$ python datamine.py
Metal: Gold, Radius: 5.5
Metal: Gold, Radius: 8.0

Radius in descending order:
Metal: Gold, Radius: 8.0
Metal: Gold, Radius: 5.5

9.4. Conclusion

In this chapter, we store the data in the list and dictionary. Then we perform the extraction operation on the dictionary using ‘list comprehension’. Lastly we used the ‘lambda operator’ for sorting the data in the dictionary.