Wayfair, Boston USA
August''19 - Present
Part of Merchandising and Tagging Team. Building automatic modeling methods to support tagging of 14+ million company’s products.
Mentoring a new hire with onboarding process and getting them ramped up on the Tagging service.
Upgraded the Tagging service to provide coverage for Boolean, Enumerated List Schema Tags (~14K/26K) for 2 use cases: Catalog Cleanup and Product Addition. Corrected ~9.5 million revenue driving Schema Tag Values.
Organized non-work team outing activities for colleagues to bond and network.
OS: LINUX, WINDOWS, ANDROID
DATABASES: MongoDB, MySQL
TOOLS: Apache Spark, Hadoop, Weka, Docker, Vagrant, Ansible, LaTeX, Github
PACKAGES: Scikit-learn, Spacy, Spark mllib, NLTK, Pandas, Scipy, Numpy, Matplotlib, Jupyter
NORTH CAROLINA STATE UNIVERSITY, USA
August'15 - May'19
Brainstormed on industrial projects from LexisNexis (LN), IBM and Laboratory for Analytic Sciences.
- Organized Data Hackathon to collect open source data from research papers and maintaining the data at SEACRAFT (http://tiny.cc/seacraft).
- Created Predictive Modeling to classify legal text documents into relevant & non-relevant. Investigated Information Retrieval (IR), Natural Language Processing (NLP) Text mining features. Researched all methods involved in Knowledge Discovery Data stages. Collaborated with a team of 3 & communicated with LN experts in a private Github repository using Agile Development process. Code Base of around 10000 LOC. Reported Results in a technical paper "The "BigSE" Project: Lessons Learned from Validating Industrial Text Mining".
- Evaluated and scaled Supervised learning, Incremental Learning, Active learning methods on StackExchange websites (~60 GB of raw data). Found LDA and TF-IDF features with SVM classification model to achieve optimal performance.
In another work, performed a Cross – Company Transfer Learning of Private Data features. [Code on Github].
Generalized idea is to share the data to others without disclosing the data where it comes from. Collected and preprocessed the phishing data from multiple sources. Features in 1 data source was mutated, and subset of samples were shared to build a predictor and predicted on other sources. Achieved better performance with SVM (RBF kernel). Summarized results and methods can be seen online.
Teaching Assistant for CSC 510 Software Engineering Course in Spring 2018
Mentored 17 Teams comprising of about 4 students each on different SE projects. Helped students with issues related to their Software framework, architecture, data modeling, which tools to use, proposed alternate solutions to solve an issue.
Data Scientist Intern
LUCIDWORKS INC, USA
May'18 - Aug'18
Mentored by Chao Han and the team. Improved current Question & Answer system by extracting fitter textual features.
Used Tika Parser to extract Spark, Hadoop, Lucene, Solr mailing lists, generating Q&A pairs (~200K) to validate extracted features.
Extracted 9 features like, Part-of-Speech, Position of Answer span, Named-entity recognition & more. Reached 86% accuracy using XGboost model. Incorporated as part of Fusion AI product.
IBM, RTP USA
May'17 - Aug'17
Part of Devops Insight Team under the guidance of Donald Cronin and mentored by Alexander Sobran.
Researched on providing insights into development practices such as how collaboration is among team, rate of issues/bugs/enhancement being closed, time taken to resolve issues/bugs/enhancements, impact of hero programmers, impact of introduction of continuous integration tools.
Analyzed 1,108 public and 538 enterprise Github Repositories. Found that contrary to Open Source principles, 80% of code are done by only 20% of developers in 77% projects. ARIMA models built on Github issues timeseries model can accurately forecast future bugs and enhancements.
DURHAM UNIVERSITY, ENGLAND
June'14 - July'14
Project title: Object Recognition using Visual Bag of Words and Principle Components Analysis
Developed existing software implementation into an extended experimental suite. Carried out scientific evaluation of proposed Principle Components Analysis (PCA) approach using benchmark datasets for the task.
Learner used was Random Forest, featurization is done using Bag of Visual Words. Code Base of around 2000 LOC. Dataset of images around 1 GB. Successfully improved the accuracy of object recognition by 7-10%.