Data Scientist

 

Wayfair, Boston USA

July''19 - Present

Part of Merchandising and Tagging Team. Building automatic modeling methods to supporttagging of 14+ million company’s products.

 

Research Assistant

 

NORTH CAROLINA STATE UNIVERSITY, USA

August'15 - May'19

 

I am working in RAISE Lab under the supervision of Dr. Timothy Menzies. 

  • Brainstormed on industrial projects from LexisNexis (LN), IBM and Laboratory for Analytic Sciences.

  • Organized Data Hackathon to collect open source data from research papers and maintaining the data at SEACRAFT (http://tiny.cc/seacraft). 
  • Created Predictive Modeling to classify legal text documents into relevant & non-relevant. Investigated Information Retrieval (IR), Natural Language Processing (NLP) Text mining features. Researched all methods involved in Knowledge Discovery Data stages. Collaborated with a team of 3 & communicated with LN experts in a private Github repository using Agile Development process. Code Base of around 10000 LOC.  Reported Results in a technical paper "The "BigSE" Project: Lessons Learned from Validating Industrial Text Mining".
  • Evaluated and scaled Supervised learning, Incremental Learning, Active learning methods on StackExchange websites (~60 GB of raw data). Found LDA and TF-IDF features with SVM classification model to achieve optimal performance. 
  • In another work, performed a Cross – Company Transfer Learning of Private Data features. [Code on Github].

    • Generalized idea is to share the data to others without disclosing the data where it comes from. Collected and preprocessed the phishing data from multiple sources. Features in 1 data source was mutated, and subset of samples were shared to build a predictor and predicted on other sources. Achieved better performance with SVM (RBF kernel). Summarized results and methods can be seen online.

  • Teaching Assistant for CSC 510 Software Engineering Course in Spring 2018

    • Mentored 17 Teams comprising of about 4 students each on different SE projects. Helped students with issues related to their Software framework, architecture, data modeling, which tools to use, proposed alternate solutions to solve an issue.​

Skills

OS: LINUX, WINDOWS, ANDROID

 

DATABASES: MongoDB, MySQL 

 

TOOLS: Apache Spark, Hadoop, Weka, Docker, Vagrant, Ansible, LaTeX, Github

 

PACKAGES: Scikit-learn, Spacy, Spark mllib, NLTK, Pandas, Scipy, Numpy, Matplotlib, Jupyter

Programming Languages

Python

Java

Scala

Shell Scripting

Javascript

Data Scientist Intern

 

LUCIDWORKS INC, USA

May'18 - Aug'18

 

  • Mentored by Chao Han and the team. Improved current Question & Answer system by extracting fitter textual features.

  • Used Tika Parser to extract Spark, Hadoop, Lucene, Solr mailing lists, generating Q&A pairs (~200K) to validate extracted features.

  • Extracted 9 features like, Part-of-Speech, Position of Answer span, Named-entity recognition & more. Reached 86% accuracy using XGboost model. Incorporated as part of Fusion AI product. 

IBM, RTP USA
May'17 - Aug'17
  • Part of Devops Insight Team under the guidance of Donald Cronin and mentored by Alexander Sobran. 
  • Researched on providing insights into development practices such as how collaboration is among team, rate of issues/bugs/enhancement being closed, time taken to resolve issues/bugs/enhancements, impact of hero programmers, impact of introduction of continuous integration tools. 

  • Analyzed 1,108 public and 538 enterprise Github Repositories. Found that contrary to Open Source principles, 80% of code are done by only 20% of developers in 77% projects. ARIMA models built on Github issues timeseries model can accurately forecast future bugs and enhancements.

Research Intern

 

DURHAM UNIVERSITY, ENGLAND

June'14 - July'14

 

Summer Internship at Intelligent Imaging Innovative Computing Group under the supervision of Dr Toby Breckon.

Project title: Object Recognition using Visual Bag of Words and Principle Components Analysis

  • Developed existing software implementation into an extended experimental suite. Carried out scientific evaluation of proposed Principle Components Analysis (PCA) approach using benchmark datasets for the task.

  • Learner used was Random Forest, featurization is done using Bag of Visual Words.  Code Base of around 2000 LOC. Dataset of images around 1 GB. Successfully improved the accuracy of object recognition by 7-10%.

Follow me

  • Facebook App Icon
  • LinkedIn App Icon
  • github.gif
scholar_icon.png