Agility, Patience, and Hadoop—a conversation with Eirini Spyropoulou on what it takes to be a data scientist in the industry

Eirini Spyropoulou works currently as data scientist at Barclays ^a)All statements and opinions expressed in this interview do only reflect Eirini’s personal point of view and are in no way endorsed by Barclays.. During her PhD at the University of Bristol we collaborated on multi-relational pattern discovery.

Eirini Spyropoulou is one of the most uplifting and fun collaborators I had the pleasure to work with: diligent, proficient, and yet pleasantly free of any egoic sentiment. Despite her youthful age, she already gathered plenty of valuable experience at both sides of Data Science research—the academic as well as the industrial.

Before acquiring her PhD at the University of Bristol from 2009 to 2013 she worked as professional software engineer for multiple companies. During her PhD she pushed forward the topic of subjectively interesting pattern discovery into the realm of multi-relational data [1, 2, 3]. After finishing, she put her freshly acquired scientific insights into practice by joining Toshiba Research as research engineer. During this time she remained affiliated with her former research group in Bristol and in full touch with all developments of academic community. Finally, in 2015, interluded by another period of full-time research [4, 5], Eirini then made the move to London to become data scientist at Barclays ^b)All statements and opinions expressed in this interview do only reflect Eirini’s personal point of view and are in no way endorsed by Barclays.. In this interview, which was originally recorded via Skype in 2015, she shares her unique perspective on Data Science in industry and academia.

Question: Eirini, you moved from Bristol to London a few month back to start as data scientist at Barclays bank. You told me that, meanwhile, you really enjoy the city. What do you enjoy about it?

Eirini: I enjoy that I have access to everything. Everything happens in London basically: Meet-ups on every topic. People you can talk to on any topic. Even non work-related events. Exhibitions. Yeah. I have the feeling that everything happens here. This is really exciting.

Question: Also your new job as data scientist sounds very exciting; one of those jobs advertised at conferences that people are keen on getting into. But what is it really about?

Eirini: It involves a number of things actually. Let’s say there is a new Data Science project. This usually does not mean that you have the data right away and you start working on it. In contrast to the big Web companies like Google or Facebook (where data is kept in massive data lakes), in the financial services industry the data is not as easily available. So the first thing data scientists work on is finding the right data sources and getting the approvals to get all the data.

Question: Do you also conceptualize new projects?

Eirini: New projects usually come from the business. What we do is manage the expectations and make sure that we define realistic requirements. This is usually done by business analysts or product managers but since Data Science is a new field, we need to be involved in this process.

Question: …and then the “real” work starts?

Eirini: Then the real work starts. At the beginning this involves a lot of interaction with the domain experts. Depending on the domain of the project this interaction needs to be more lengthy or less. In some domains data exploration and feature extraction is done in a continuous feedback loop with the experts.

Question: Alright, but at some point I suppose, the work will also reach a more technical level; like actually creating a statistical model based on the data.

Eirini: Yes. Before modeling though there is an additional step, which is data cleaning. Because you never have clean data and values will be missing—especially when you have data records that have been manually entered at some point. So we have to deal with these problems before we can apply any modeling algorithm.

“You can’t have people that know only one thing”

Question: Ok. So now you already mentioned quite a variety of activities: requirement analysis, definition of data sources and data gathering, data cleaning, finally modeling. Would a typical data scientist really be involved in all those steps equally or could someone dream of, say, having a career just based on data cleaning?

Eirini: No, you need to know everything. Of course if someone has done perhaps a cool PhD on data cleaning this could be useful, but you can’t have people that know only one thing.

Question: Can you talk a bit about the specific technologies that data scientists are using in the industry?

Eirini: Yes, we are using all of the open source technologies for Big Data: Hadoop, Spark, the ML library of Spark, all of the products in this ecosystem.

Question: Aha. So what languages does that involve? Java or Scala mostly?

Eirini: Well, at this point we need to be flexible because there isn’t one language that satisfies everything we need. Personally, I prefer to stay in the Python side of things and mainly use PySpark, but Scala fits better with the map reduce framework. The only problem is, at the moment not all Machine Learning algorithms are implemented in Spark’s MLlib. So if something is not there, with Scala you do not have the variety of algorithms that you have with Python.

Question: Only then you would not use the algorithm in a distributed environment?

Eirini: That’s true, but if you have managed with map reduce operations to bring your problem down to something smaller, then you can still use Python libraries if you use PySpark. But I guess this is something that in the future will not be a problem, because I know that the open source community is working very hard to enrich the MLlib of Spark more and more.

“The PhD gives you the confidence that you can do everything”

Eirini teaching her audience on multi-relational pattern discovery. Her research showed that, while it might look deceivingly easy to map all your data to one flat table, you might miss out on insights and/or make false discoveries unless you capture the full multi-relational nature of the world.

Question: So now we have established that a wide variety of skills is required for your job. Which of those skills did you actually acquire through making a PhD?

Eirini: (Thinks…) That is a difficult one. I guess…

Question: The social skills!?

Eirini: No, not really (laughs). No, the PhD what it gives you is basically the confidence that you can do everything. For example in the beginning of a project, everything is fuzzy and you don’t know what model you are going to end up with. Handling such situations is what I got from the PhD.

Question: So if you could talk to your former self now that is just about to start a PhD, would you perhaps recommend to change the direction or put more emphasis on certain things?

Eirini: One thing I would do different now is to do an internship in the industry during the PhD, because we (in academia) do not really understand how things are in the industry. For people who are not sure if they want to stay in academia or not after their PhD it is important that they have worked on a real business problem and with real data.

Question: Is there a specific class of business problems that you could imagine now that are actually related to what you were doing? How far is multi-relational pattern mining from business cases?

Eirini: (Laughs…) Ok. Well I guess one can say the industry is a few steps behind what we are doing in research. I mean, it is not cutting edge research in the industry. What is interesting though is that I see a real need for exploratory data mining the way we have imagined it in academia.

Question: Tell me more, please.

Eirini: The first question that the stakeholders usually ask is “what patterns can you see in the data”. At the moment there is no tool that can help them do this over big data.

“I am surprised that there is still not so much research of how to take advantage of the map reduce framework and specifically Spark”

Question: Ok, but despite exploratory analysis it sounds like for most practical problems you will find a research paper with the perfect answer. Is that really so or did you find any surprising gaps that research left open?

Eirini: The usual problem is how you do modeling with data that has both numerical and categorical features.

Question: Oh, interesting. I always find myself thinking that this is surprisingly hard to do properly.

Eirini: Yes! And this is what comes up all the time and I never have the perfect solution for it.

Question: I remember talking about this issue with Peter Flach sometime back and in his book, which I like a lot, there are some interesting approaches, for instance what he calls feature calibration ^c)The idea of feature calibration in a nutshell is to replace a categorical value v with the posterior probability of the positive class conditioned on v., with which you can convert categorical into numerical features. The problem with that is, when you apply it, your problem can turn from linearly separable into inseparable.

Eirini: I see. I guess how good an approach works depends on how big the domain of the categorical feature is and what machine learning algorithm is used.

Question: Right. So, despite these exceptions, would you say that Data Mining and Machine Learning research are generally addressing the right topics?

Eirini: It is hard to say, but what I am surprised of is that there is still not so much research of how to take advantage of the map reduce framework and specifically Spark. Possible that this is done in other conferences than the ones I am used to go ^d)Conferences where you are likely to meet Eirini typically are Knowledge Discovery conferences like KDD or ECMLPKDD..

Question: Perhaps that is more done in the Systems community?

Eirini: Yes, but there are different aspects as well. A new parallel algorithm for a known task could be part of a Data Mining conference. A new execution strategy for Spark would be a better fit for a Systems conference.

“The evaluation of exploratory data analysis—it needs real people”

Question: That makes sense. Moving from scientific content to soft factors: when you compare the overall feeling of being in the industry in contrast to an academic institution, what are the most striking differences?

Eirini: The most striking difference is the speed in which a project in the industry is turned from nothing to something. It’s incredible. I was not used to that, because usually what we do in academia is, we have a project idea then we discuss about it, and then we start thinking, we spend a lot of time thinking about it, what would be the best solution for this problem and so on. But in industry it is the other way around. You start doing stuff and you could well have a model in one day.

Question: And you learn a lot in that process I suppose?

Eirini: Yes, and it is more motivating in a way. At least for me, because I start seeing an outcome right away. And this keeps me wanting to do more. That is the main difference.

Question: So what could we say in defense of the academic approach then?

Eirini: Well the academic approach is more sound, right? In academia we always have a sound evaluation framework which we use to show that our method performs better than all the others. In the industry it is not about having the perfect model, at least not at the first stage of the project, but it is about having something there that works acceptably. And then you go on and refine it. This is this notion of agile project development.^e)During the review process of this interview, Eirini stressed that using agile is still a new thing in Data Science and that she believes that adding a sound evaluation of models as part of the process is important and will become even more important in the future.

Question: “Minimum viable product first; then continuous iterative improvement.” Ok, but wait. Focussing specifically on exploratory data analysis—what is here the sound evaluation framework of academic research?

Eirini: Ah, I see. That is a different thing. The evaluation of exploratory data analysis—it needs real people. That is the only thing that we do not do in academia. There are other evaluations that we can put into a paper to get it accepted, but, ultimately I guess it is user testing. And that is I guess the difference between industry and academia, because in industry you have the users right there. Even if you do not do proper A/B testing, they will at least tell you if they found it useful. There is a direct interaction with the end-user. They tell you what they do not like, what does not work, why it doesn’t work. And then you can go on and refine your method.

Question: Yes, and then you would in any case have a practical proof-of-concept, right? So is it fair to say that once you have an established theory, evaluation in academia is more sound, but when this is not the case, then we are also struggling?

Eirini: Indeed.

“Being in the real world makes you realize that there are a lot of different people out there and you have to have the patience to deal with all of these different personalities”

Original snapshot from our whiteboard in Bristol when developing a fast enumeration algorithm for multi-relational patterns. The solution we came up with together with Tijl de Bie makes use of the idea of divide-and-conquer fixpoints enumeration of closure operators [6].

Question: How do you perceive the overall atmosphere in both worlds; seen more from the social side? Where did you find more love, joy, and happiness?

Eirini: (Laughs…) The transition wasn’t easy, to say that.

Question: Why?

Eirini: Because most of my life before I was in academia. So being in the real world makes you realize that there are a lot of different people out there and you have to have the patience to deal with all of these different personalities. I don’t know. I got the feeling that in academia—I mean we are still different people—but we have a common denominator let’s say. It was a lot different in the real world and it took me some time to get used to.

Question: So you need more patience?

Eirini: You need more patience and you need to be more open minded about what people know, what they should know, how to interact with them.

Question: Do you receive this open-mindedness also from the other side?

Eirini: If you don’t find it in the other people then you have to be the one. I mean there need to be someone in the whole network of interactions who is open-minded for this to work (laughs).

Question: What drives someone prototypical from the industry as opposed to someone prototypical from the academic world? What are their motivations?

Eirini: Yes, right there are different motivations. Usually a person from academia would be interested in solving the problem. Other people, depending on their position, would be interested in using your product or selling your product.

Question: I guess the stereotype would be that people in the industry are more materialistic. Did you find there is anything to this stereotype?

Eirini: I do not think that this has an effect on everyday working life. At least from what I have seen.

“I do hope that universities will get a bit more open-minded and employ some people from the industry to teach the new generation of data scientists.”

Question: Where do you think do people feel more secure?

Eirini: It depends on the level that you are on in academia, right? On the post-doc level that I was on when I left academia, obviously industry is more secure. At least to me. But at higher level I am not sure. And now if we talk not only about security but also about professional growth, then again I am not sure. After a certain point it requires a lot of skills—in both worlds.

Question: How about yourself? Do you see any way back to academia?

Eirini: I don’t know. I keep wondering actually, because, when I started I thought this is just a break for a few years and then I go back to academia, but now I am wondering what actually makes me more happy. There is the practical side of course, because I don’t have the chance to publish as much now.

Question: So is the way back even possible?

Eirini: Well everything is possible (laughs). I do hope that universities will get a bit more open-minded and employ some people from the industry to teach the new generation of data scientists. It would be really great if there was more collaboration at least.

Question: Yes, but I suppose for this mostly the industry side had to change, right?

Eirini: Yes.

Question: And how realistic is that?

Eirini: Everything is new and things are changing. So I still have hopes. But for this to happen, and since data from the industry is usually very sensitive, it could take years. But I do know a few successes. I recently got to know that the university of Leeds established a new research center where they managed to have a huge infrastructure, and they got data both from the NHS (UK National Health Service) and from Sainsbury’s (supermarket chain in the UK). Which is great. So if these things start happening it would be really beneficial for both worlds.

Question: Eirini, thanks a lot for this interesting conversation.

Eirini: Thanks as well. I enjoyed it a lot.

References

[1] E. Spyropoulou, T. De Bie, and M. Boley, “Interesting pattern mining in multi-relational data,” Data mining and knowledge discovery, vol. 28, iss. 3, p. 808–849, 2014.
[Bibtex]

@article{spyropoulou2014interesting,
title={Interesting pattern mining in multi-relational data},
author={Spyropoulou, Eirini and De Bie, Tijl and Boley, Mario},
journal={Data Mining and Knowledge Discovery},
volume={28},
number={3},
pages={808--849},
year={2014},
publisher={Springer US}
}

[2] E. Spyropoulou, T. De Bie, and M. Boley, “Mining interesting patterns in multi-relational data with n-ary relationships,” in International conference on discovery science, 2013, p. 217–232.
[Bibtex]

@inproceedings{spyropoulou2013mining,
title={Mining interesting patterns in multi-relational data with n-ary relationships},
author={Spyropoulou, Eirini and De Bie, Tijl and Boley, Mario},
booktitle={International Conference on Discovery Science},
pages={217--232},
year={2013},
organization={Springer Berlin Heidelberg}
}

[3] K. Kontonasios, E. Spyropoulou, and T. De Bie, “Knowledge discovery interestingness measures based on unexpectedness,” Wiley interdisciplinary reviews: data mining and knowledge discovery, vol. 2, iss. 5, p. 386–399, 2012.
[Bibtex]

@article{kontonasios2012knowledge,
title={Knowledge discovery interestingness measures based on unexpectedness},
author={Kontonasios, Kleanthis-Nikolaos and Spyropoulou, Eirini and De Bie, Tijl},
journal={Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery},
volume={2},
number={5},
pages={386--399},
year={2012},
publisher={John Wiley \& Sons, Inc.}
}

[4] J. Lijffijt, E. Spyropoulou, B. Kang, and T. De Bie, “Pn-rminer: a generic framework for mining interesting structured relational patterns,” International journal of data science and analytics, vol. 1, iss. 1, p. 61–76, 2016.
[Bibtex]

@article{lijffijt2016pn,
title={PN-RMiner: a generic framework for mining interesting structured relational patterns},
author={Lijffijt, Jefrey and Spyropoulou, Eirini and Kang, Bo and De Bie, Tijl},
journal={International Journal of Data Science and Analytics},
volume={1},
number={1},
pages={61--76},
year={2016},
publisher={Springer International Publishing}
}

[5] M. Leeuwen, T. Bie, E. Spyropoulou, and C. Mesnage, “Subjective interestingness of subgraph patterns,” Machine learning, p. 1–35, 2016.
[Bibtex]

@article{leeuwen2016subjective,
title={Subjective interestingness of subgraph patterns},
author={Leeuwen, Matthijs and Bie, Tijl and Spyropoulou, Eirini and Mesnage, C{\'e}dric},
journal={Machine Learning},
pages={1--35},
year={2016},
publisher={Springer US}
}

[6] M. Boley, T. Horváth, A. Poigné, and S. Wrobel, “Listing closed sets of strongly accessible set systems with applications to data mining,” Theoretical computer science, vol. 411, iss. 3, p. 691–700, 2010.
[Bibtex]

@article{boley2010listing,
title={Listing closed sets of strongly accessible set systems with applications to data mining},
author={Boley, Mario and Horv{\'a}th, Tam{\'a}s and Poign{\'e}, Axel and Wrobel, Stefan},
journal={Theoretical computer science},
volume={411},
number={3},
pages={691--700},
year={2010},
publisher={Elsevier}
}

[ + ]

a, b.	↑	All statements and opinions expressed in this interview do only reflect Eirini’s personal point of view and are in no way endorsed by Barclays.
c.	↑	The idea of feature calibration in a nutshell is to replace a categorical value v with the posterior probability of the positive class conditioned on v.
d.	↑	Conferences where you are likely to meet Eirini typically are Knowledge Discovery conferences like KDD or ECMLPKDD.
e.	↑	During the review process of this interview, Eirini stressed that using agile is still a new thing in Data Science and that she believes that adding a sound evaluation of models as part of the process is important and will become even more important in the future.

3 comments on “Agility, Patience, and Hadoop—a conversation with Eirini Spyropoulou on what it takes to be a data scientist in the industry”

Behrooz Omidvar-Tehrani on August 31, 2016 at 11:16 pm said:

Awesome interview! Enjoyed reading it. Thanks Mario!

- Mario Boley on September 1, 2016 at 8:10 am said:
  
  Thanks a lot, Behrooz. I’m very glad you liked it.
  
Pingback: Exploratory Data Analysis: It Needs Real People – Behrooz Blog

realKD

discovering real knowledge from real data for real users