Configuring Stanford NER tagger for MAC

Stanford Named Entity Recognizer (NER tagger)  is available via NLTK library. It can be configured as follows,

    1. Download the required packages from the links below,
      Package Download Link Version
      Stanford Parser  3.8.0
      Stanford NER 3.8.0
      Stanford POS tagger 3.8.0
    2. Create a new folder (Stanford Parser) and extract the downloaded archives in this folder
    3. Set the path variable in the `~/.bash_profile` file and point it to the location of the Stanford Parser

vim ~/.bash_profile
export $PATH=’/usr/<>/StanfordParser’
source ~/.bash_profile

Test your path settings,
echo $PATH

4. Finally test your parser

import nltk.tag import StanfordNERTagger
# add the path for the stanford nertagger
stanford_classifier = '/StanfordParser/stanford-ner-2015-12-09/classifiers/english.muc.7class.distsim.crf.ser.gz'
stanford_ner_path = ' /StanfordParser/stanford-ner-2015-12-09/stanford-ner.jar'
st = StanfordNERTagger(stanford_classifier,stanford_ner_path,
text = “created sample text for Stanford parser”
tokenize = word_tokenize(text)
classified_text = st.tag(tokenize)

The output of the snippet above:
[('created', 'O'), ('sample', 'O'), ('text', 'O'), ('for', 'O'), ('Stanford', 'ORGANIZATION'), ('parser', 'O')]


Simple data preprocessing with Pandas

Quick Data Preprocessing with Pandas and SciKit

Data Preprocessing

Before you apply ML algorithms, it is indeed necessary to preprocess and convert the data to a standard format. The data preprocessing can be done using Pandas by following these simple steps:

In [36]:
#import all the necessary libraries 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
%matplotlib inline

Let’s load the famous IRIS dataset.
The IRIS dataset (csv format) has 4 attributes with no missing values and will be used for classification.

In [30]:
input_path = ''
iris_contents = pd.read_csv(input_path ,header = None)
#Check the column attributes 
Int64Index([0, 1, 2, 3, 4], dtype='int64')
In [31]:
# Set the column information 
iris_contents.columns = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Class']

Check the first few lines of the data

In [32]:
Sepal Length Sepal Width Petal Length Petal Width Class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Inputs and labels are stored in different variables

In [33]:
#Exluding the last column
inputs = iris_contents[iris_contents.columns[:-1]]
# Retaining only the last column for labels
labels = iris_contents.ix[:,4]

Feature Distribution

Panda’s “describe” function can be used to review the distribution of each attribute in the given dataset

In [34]:
Sepal Length Sepal Width Petal Length Petal Width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Visualizing the distribution

The feature distribution can be visualized using boxplot

In [35]:

Sepal width attribute has some outliers with some of the points being above the upper whisker.Also, the median of these attributes are not at the same level. Let’s standardize these features using sklearn library.


Standardization technique is used to transform the attributes with differing means and standard deviations to a standard gaussian distribution

In [44]:
# scale
std_inputs = scale(inputs)
# numpy to pandas conversion
res_inputs = std_inputs.reshape((-1,4))
std_df = pd.DataFrame({'Sepal Length':res_inputs[:,0],'Sepal Width':res_inputs[:,1],'Petal Length':res_inputs[:,2], 
                      'Petal Width' : res_inputs[:,3]})

Let’s visualize the standardized data using the boxplot

In [45]:

We can now see that the medians of Sepal Length and Width are almost at the same level and so is the case with Petal Length and Width. The standard deviation will be same for all of these features and can be verified by calling “describe” or “std” function

In [47]:
Petal Length    1.00335
Petal Width     1.00335
Sepal Length    1.00335
Sepal Width     1.00335
dtype: float64
In [49]:
Petal Length    0.336266
Petal Width     0.133226
Sepal Length   -0.052506
Sepal Width    -0.124958
dtype: float64


The steps we followed for data preprocessing:

  1. Extract the data
  2. Analyze the distribution
  3. Standardize the data

Now the data will be ready for feature extraction/selection and classification.

Pushing to a Github repository

You can follow the below simple steps to push your project to your codebase repo.

Step1 : Make sure git is installed on your system.

Step 2 : Once you have installed git  you will need to start by  running the following commands in your command line

$git config –global “username”

$git config –global  email

Step 3: Use the following command to enter the directory from where you want to push to your repo

$cd  path/localfolder

Step 4: Initialize git in this directory

$git init

Step 5 : Add all the files in this directory and commit using below commands

$git add  .

$git commit

Step 6: Add a remote reference

$git remote add origin yourproject/example.git

Step 7 :  Push all the changes to new origin

$git push origin master

That’s it . You are all set!!

Hadoop : Single Node Installation

This blog describes how to set up and configure a Single Node Hadoop Installation on Ubuntu. Installation should take about 20 minutes in total.

Step 1: Install JAVA .

Step 2: Set up passphrase-less ssh

Check that you can ssh to the localhost without a passphrase

>> ssh localhost

If you cannot ssh to local host without a passphrase, execute the following commands

>> ssh-keygen -t dsa  -P ‘’ –f  ~/.ssh/id_dsa

>>cat   ~/.ssh/ >> ~/.ssh/authorized_keys



Step 3: Create a directory for HADOOP and download recent stable version from here  and extract using the below command

>>tar  zxvf   Hadoop-2.6.0.tar.gz


Step 4: Set the following environment variables in .bashrc and run the command ‘bash’ for your changes to have an effect.

export JAVA_HOME = <Path to JAVA file>

export HADOOP_HOME = <Path to HADOOP folder>





export HADOOP_CONF_DIR = $HADOOP_HOME/etc/Hadoop

export PATH = $HADOOP_HOME/bin:$PATH


Step 5: In your Hadoop distribution go to etc/hadoop/ to define some parameters as follows

export JAVA_HOME = {your JAVA home directory}

export  HADOOP_PREFIX = {your hadoop distribution directory}


Step6 : Run the command bin/hadoop to ensure that hadoop is installed properly. This command will display the usage documentation.


Step7: Create directories for namenode and datanode daemons. Make sure that they don’t share the same id.

>> mkdir –p hdfs/namenode

>> mkdir –p hdfs/datanode


Step 8: Go to etc/hadoop/core-site.xml, It should contain the following configuration,







Step 9: Go to etc/hadoop/hdfs-site.xml:









<description>Paths on the local filesystem for DataNode blocks.</description>





<description>Path on the local filesystem for the NameNode namespace and

transaction logs.</description>




Step 10: Go to etc/hadoop/yarn-site.xml









Step 11: Go to etc/hadoop/mapred-site.xml








Step12: Format the namenode:

Do this only once at install time, else the data will be erased

>>$HADOOP_HOME/bin/hdfs  namenode –format


Step13: Start the daemons




Step 14: The commands “hadoop version” should display the hadoop version and the “jps” command should display all the daemons.


Step15: Finally!! You can check the namenode using the link, http://localhost:50070 and check YARN using http://localhost:8080


Step16: You can run some mapreduce examples provided under share/hadoop/mapreduce-examples-2.6.0.jar



Hadoop Installation Summary

  • Check JAVA version
  • Setup ssh to localhost
  • Download the hadoop s/w
  • Edit
  • Create some directories for daemons
  • Update the config files
  • Format the namenode directory
  • Start the daemons
  • Run a couple of test programs