Integrating Natural Language Processing and Computer ... · of Natural Language Processing (NLP)...

Integrating Natural Language Processing andComputer Vision into an Interactive Learning

Platform

Benson Liu Rithesh [email protected] [email protected]

Aditya Shukla Adina Solomon Lucy [email protected] [email protected] [email protected]

Jessica Kuleshov*[email protected]

New Jersey’s Governor’s School of Engineering and TechnologyJuly 21, 2020

*Corresponding Author

Abstract—Virtual learning platforms are digital interfacesthat allow educators and learners to communicate and shareresources that facilitate education. Currently, most of these lackresources to increase engagement among students. Barriers suchas language, digital literacy, and limited features often makeplatforms difficult to use. The purpose of this research is to createa comprehensive learning platform by incorporating conceptsof Natural Language Processing (NLP) and machine learning.The platform was developed with Python and APIs such assocket programming for server connections, OpenCV for visualmachine learning, and Google Speech Recognition for auditoryNLP. After integrating components to the platform, the featureswere rigorously tested for accuracy to assess the success of amore comprehensive learning environment.

I. INTRODUCTION

The trend towards virtual learning has become increasinglypopular. According to the National Center for EducationalStatistics, 3.3 million students attended college exclusivelyonline in 2018 [1]. Online platforms offer students access toresources that are often inaccessible due to geographic barriersor other socioeconomic factors. Additionally, online learningallows for a more tailored educational experience. Most studieson the efficacy of virtual learning have shown improvementsin student understanding. A study conducted by Sh!fit E-Learning found that students engaged in virtual learning havea knowledge retention rate of 25-60%, whereas face-to-facelearning is around 8-10% [2]. When used effectively, onlinelearning allows students to improve their performance by read-ily providing resources that are unique to the student’s needs.The same study found that students were able to learn nearlyfive times more material than traditional learning without

increasing time spent. More recently, the global Coronaviruspandemic has disrupted 1.2 billion learners globally, forcingmany educational institutions to adopt virtual platforms suchas Zoom, Google Classroom, WebEx, or Canvas to replicatetraditional educational models [3] - [5]. For these reasons,online learning is now a serious educational option for bothprivate and public institutions.

Although the demand for learning platforms continues togrow, several concerns with usability and low digital literacyhave made online learning a challenge. Most online platformsare restrictive and students have difficulty following alongwith lectures, staying engaged, and facilitating discussions orasking questions. Furthermore, limited features make it hardto share resources and communicate on a single platform.For instance, Zoom & WebEx are popular platforms forcommunication, while Canvas and Google Classroom are usedfor resource sharing.

By far, one of the largest barriers is combating distractionsfor younger students. Many studies have shown that onlinelearning for this age group is not as effective because oftheir inability to mediate distractions. The lack of features toengage students on learning platforms does little to resolve thisissue. A study published in the Journal of Child Psychologyinvestigating dual-modality learning exposed students to visualand auditory sources while reading. The research suggestedthat this method helped improve students’ vocabulary [6].Video conferencing platforms, if integrated with the rightfeatures, can take advantage of dual modality to create anenhanced virtual learning environment.

1

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

With rising demand in the short and long term, the onlineeducation industry is experimenting with adaptable learningplatforms of the future. Despite the growing necessity forvirtual learning, most contemporary products struggle to ac-commodate traditional teaching methods. This paper discusseshow NLP and computer vision were implemented to tacklecommon issues with existing virtual learning platforms.

II. TECHNICAL BACKGROUND

Initial research was conducted on existing video conferenc-ing features and the programming techniques used to imple-ment them. After identifying features to enhance, research wasconducted on possible implementations.

A. Socket Programming

Socket programming is frequently used to grant accessto the World Wide Web with Hypertext Transfer Protocol(HTTP). It is also used with other communication platforms torequest and receive information. This module gives multipledevices the ability to connect to a singular network. A socketis an endpoint to a two-way communication pathway. Themodel starts with at least two sockets, a server and a client,but can incorporate multiple clients simultaneously withoutthe knowledge of each individual’s address. The server actsas a central hub with the ability to receive and distribute databetween multiple clients through different network protocols.The client communicates directly with the server, requiring in-formation about the address of the server to initiate the streamof communication. The client and server can communicatethrough a Local Area Network (LAN) if they reside in thesame local network or utilize a Wide Area Network (WAN)to communicate across multiple LANs as shown in Figures 1and 2 [7].

Fig. 1. Flow of data through a LAN Network with local connections. Source:Adapted from [7].

Fig. 2. Flow of data through a WAN Network. Source: Adapted from [7].

B. Encryption

Although socket programming provides an optimal methodfor the exchange of large quantities of data, as communicationbecomes more accessible over networks and the amount ofdata transferred increases, the threat of data breaches alsoincreases. From simple searches on Google to calls on Zoom,data is constantly encrypted to protect sensitive information.

Encryption is a process in which a message is encodedand decoded so that only certain parties can access the datain a readable format. The decrypted data is referred to asthe plaintext, while the encrypted data is referred to as theciphertext. A message is encrypted and decrypted through theuse of cryptographic keys that translate between plaintext andciphertext. There are two types of encryption: symmetric andasymmetric. Symmetric encryption uses a single key for bothencryption and decryption which both parties must have accessto [8]. On the other hand, asymmetric encryption uses publicand private keys for encryption and decryption. Both methodsinclude encryption algorithms to mathematically use a key toconvert data into ciphertext. Although there are many differentencryption algorithms, Advanced Encryption Standard (AES)is commonly used with symmetric encryption. AES revolvesaround the idea of a substitution-permutation network wherecertain outputs are based on direct substitution while othersare formed by rearranging inputs. The number of rounds inAES encryption is based on the length of the keys. For ageneral 128-bit key, there are ten rounds with a unique 128-bitencryption round key. Each round for a 128-bit key includesfour steps. First, the 128 bits are consolidated as 16 bytes,or a 4 by 4 matrix that undergoes substitution with a fixedtable. Then, each row in the matrix is shifted to the left andeach column undergoes a mathematical computation. Lastly,all 128 bits are XORed (refer to figure 5) before an outputis generated. In order to decrypt the data, the rounds andindividual sub-sections are performed in the reverse order withthe same keys [9].

2

Fig. 3. AES encryption architecture with different sized keys. Source:Adapted from [9].

Fig. 4. Stages of AES encryption in a single round with 128 bits cipher keys.Source: Adapted from [9].

Fig. 5. The logic behind XOR gates with 2 inputs. Source: Adapted from[10].

C. Computer Vision & Image Processing

For the purpose of this project, machine learning will referto an application of Artificial Intelligence geared towardstraining systems to autonomously master and refine theirability to execute certain tasks. Computer vision draws heavilyfrom machine learning to process visual (image and video)data. As humans, we have the ability to recognize and drawconclusions from our surroundings using vision and perception[11]. Computers replicate this phenomenon by employingalgorithms for object identification, feature classification, andimage segmentation. Using these trained models, computerscan locate, track, and analyze objects [12].

To aid computer vision tasks, image processing appliescertain transformations to refine raw input images so that theycan be used for further analysis [13]. Image manipulationtechniques such as filtering, feature sharpening, edge and color

detection, and noise removal are performed to prepare data forthe computer vision algorithms.

1) OpenCV: OpenCV is an open source machine learn-ing software library designed to optimize real-time computervision processes. OpenCV was originally developed in C++but is utilized as a cross-platform software, compatible withpopular languages including Python, C#, and Java. It has morethan 2500 built-in algorithms that lay out a comprehensiveinfrastructure for developers to use [14]. Specifically, in itsapplications with Python, users are able to perform operationsand analysis on image arrays with greater efficiency usingmodules such as NumPy and Matplotlib.

2) Images & Image Convolution: A digital image is anarrangement of pixels in a two-dimensional grid with a spec-ified height and width. Each pixel stores color informationas numerical data. Different color models such as RGB orCMYK require varying amounts of bits per pixel to representthe proper color. Grayscale images are typically 8 bits, whileRGB pixels are typically 24 bits (3 bytes) or 48 bits (6 bytes).RGB pixels store separate red, green, and blue values whichare combined to form a single color [15], [16]. To store moreinformation about the pixel, with each additional byte, anadditional value is added into the pixel or the range of pre-existing values is expanded. The figure above illustrate that

Fig. 6. The more bits a pixel has, the greater the specificity and range ofcolors it can represent. Source: Adapted from [17]

more bits per pixel allow for greater specificity and a widerrange of represented colors. These color values can be scaleddown or converted with convolution techniques. Convolution isthe mathematical operation of combining two arrays to createa resulting array that is used to apply filters on images toalter their appearance. The convolution procedure incorporatesthree main components: the input image, kernel, and outputimage. A kernel is a smaller square matrix that acts as afiltering mask when layered over a specific region of an inputimage [18]. Each entry in the kernel is assigned a weightthat is applied to the pixel under it. The new pixel values aregenerated by multiplying the kernel weights to the pixels inthe corresponding positions below them, and then summingthe totals together to return an output for the pixel directlybelow the origin, or center entry of the kernel. Convolution isa linear form of image processing that relies on the weightedsum of neighboring pixels to generate output values.

3

Fig. 7. Kernel layering onto an input matrix to output a pixel value. Source:Adapted from [19].

The general equation of convolution is represented as:

g(x, y) = w∗f(x, y) =a∑

dx=−a

b∑dy=−b

w(dx, dy)f(x+dx, y+dy)

In this equation, g(x,y) is the filtered output image, f(x,y)is the input image, and w is the filter kernel. The total outputvalue can also be normalized by dividing the output by the totalkernel entry count to ensure that the pixel preserves its originalbrightness and its value stays within range. The convolutionfilter cannot be applied to edge pixels unless a padding isapplied, in which a set of fixed values, usually all 0’s or 1’s,creates a border around the image to retain the image’s originaldimensions [20]. The kernel shifts across the image, and theconvolution process is repeated until all the pixel values in theinput image have been converted into a separate output image.

3) Image Filtering Techniques: Depending on the arrange-ments of weights within a kernel, the effect that the mask hason the output image will vary. Different kernel convolutionshave the ability to blur, sharpen, emboss, and outline featuresin images. Figure 8 shows several common kernel operationsthat can be applied to an image using the convolution methodmentioned in the previous section.

Gaussian blur kernels are applied in order to smooth outnoise impurities during the preprocessing phase. It performsbetter than other blurring techniques because it yields finerresults by using a normal distribution function represented as:

G(x, y) =1

2πσ2e−

x2+y2

2σ2

X and Y are the respective horizontal and vertical distancesfrom the origin, and σ is the standard deviation. Rather thanapplying a uniform spread, gaussian blurs put more emphasison center weights in the kernel [22]. As shown in Figure 9,this effect can be modelled using a 3D-bell-curve to illustrate

Fig. 8. Several kernel layouts and their visual effect after being overlaidacross the initial picture of the house. Source: Adapted from [21].

Fig. 9. Normal Distribution Graph. Source: Adapted from [22]

the distribution of the values in a gaussian blur kernel matrix.The values are more elevated towards the center, indicatingthey have a higher weight than the surrounding ones [23].Additionally, larger kernels will have a greater blurring effecton the underlying image.

D. Natural Language Processing

NLP is a field of machine learning focused on convertinghuman languages into data that a computer can recognizeand extract information from. NLP allows a computer tounderstand human languages, like English and Spanish, byconverting letters into numbers using advanced algorithms[24]. Speech recognition, the subset of NLP used in thisproject, includes voice search and context recognition [25].

Current speech recognition systems have three sets of con-versions: sound to electrical signals, analog electrical signals

4

to digital signals, and digital signals of audio to text. Toconvert the digital signal into text, most modern systems usethe Hidden Markov Model (HMM) which assumes that aspeech signal, when interpreted on a small timescale (like tenmilliseconds), can be viewed as having constant properties.These small fragments of speech are converted into vectorsknown as cepstral coefficients and then matched to a fun-damental unit of speech called phonemes. An algorithm isthen applied to determine the closest word from the specificset of phonemes. Since phonemes vary based on speaker,these programs require intensive training in order to improveaccuracy. Modern systems use neural networks to refine theaudio signals before HMM recognition [26].

This project specifically uses two Python APIs for NLP:PyAudio and Speech Recognition. PyAudio is an open sourcelibrary based on the PortAudio library in C/C++ that al-lows audio to be recorded and played easily [27], [28]. TheSpeech Recognition library consists of multiple APIs includingGoogle Speech Recognition and GoogleTrans which were usedin this project. This provides multiple features such as voice-to-text and text-to-text translations.

III. EXPERIMENTAL PROCEDURE

This section documents the different challenges that arosewhile implementing the following features: speech-to-text sub-titles, translation in English and Spanish, voice-driven pollingusing OpenCV, a screen sharing feature, and a web searchfeature. Motivations for choosing certain implementations overothers are also discussed.

A. Socket Programming Implementation

Socket programming provides an efficient method to trans-mit data between a server and a number of clients. In orderto create a virtual learning platform, the transmission of dataneeds to be fast (in terms of the processing rate) and reliable.The two primary types of protocols are Transmissions ControlProtocol (TCP) and User Datagram Protocol (UDP). TCPestablishes a connection-oriented protocol used in TCP/IPnetworks, while UDP is a connectionless protocol. Therefore,TCP provides additional in-built error checking in order toconfirm that the packets of data were sent appropriately. Thismethod allows TCP to have a specific flow control mechanismthat dictates the specific amount of data flowing from clientsto servers and servers to clients. The data is sent through asend-and-receive buffer to prevent an overwhelming amount ofdata from transmitting at a given moment. As a result, TCP isgenerally utilized in applications that require high reliability,such as the web (HTTP) and email. Therefore, TCP providesthe optimal type of communication for a steady flow of dataand ensures that the connection is maintained at all timesduring a call [29].

An initial implementation of socket programming withvideo conferencing was analyzed based on two-person andfour-person video conferencing platform. The program in-cluded a traditional client and server model in which each

client sends their data to the central server before the servertransmits the data to each individual client. The program alsodealt with unique ports to transmit different types of datasuch as the audio and the video [30]. However, this approachincluded an unnecessary step when dealing with only twoclients as socket communication can occur directly throughthe server (teacher) and the client (student). Moreover, usingmultiple ports increases the overall load on the computerwhich causes lag, reducing the quality of transmission. Thefollowing code shows the basic binding of a server socket toa port and a client socket to the server socket.

sock.bind((host, port))self.sock.connect((self.addr, self.port))

In search of a more efficient method of communicationbetween a single server and client, a video/audio chat projectcreated by DBC projects was adapted to create a base platformwith direct communication between the server and clientsockets. With this logic in the platform, the host or the serversocket waits for a connection to itself while the client seeksto connect to the server socket. The transmission of the datathrough sockets for this project was centered around the ideaof using only one port to send and receive data [31]. Thisresulted in a better overall performance during testing. Afterbuilding an initial framework for audio and video transmission,the audio contained noticeable lag. In order to mitigate theissue, experimentation with a lower sample rate per secondfrom 22050 to 8000 and a lower number of frames perbuffer from 2048 to 1024 significantly reduced lag in theaudio. Although this reduced the audio quality itself, thechange was unnoticeable and it allowed for a less intensivetransmission. The code below packages all the informationincluding the frame size, image frame, audio frame, caption,and poll information into byte format before it is sent throughthe socket.

frame_size = (4 + len(img_str) +len(self.audio_buff) + captionLength +pollLength).to_bytes(4, byteorder='big')

send_frame = frame_size + img_size +img_str + self.audio_buff +captionBytes + pollBytes

Two devices were initially connected through LAN with theIPv4. This allowed for a simple setup by passing in the otherdevice’s IPv4 address. However, this was limited to within asingle local network. In order to connect remotely within aWAN, the concept of port forwarding was utilized to allowdata to flow to a certain port on a device from an outsidesource. This eliminated the need to take down all the firewallswhich protect a device by blocking unauthorized connectionsfrom a private network.

B. AES Encryption

In order to keep the information of students and teach-ers safe, encryption was used in the educational platform.

5

Especially with minors, the connection between the teacherand the student needs to be secure at all times. Even if thedata is intercepted, a third party should not be able to readit. Symmetric encryption was the best option for a voiceconference so multiple users could encrypt and decrypt thedata with a common key. Therefore, AES Encryption wasutilized with 128 bit keys, using the Crypto Cipher package.In order for the server and client to appropriately encrypt thedata on the sending side and decrypt the data on the receivingside, the encryption key must be identical on both ends. It isgenerated through a parameter, in this case, the meeting ID. Ifthe server and client both use the same meeting ID, they willhave the same encryption and decryption key and can viewthe data received. However, if the meeting IDs do not match,then the encryption keys generated on the client and serverside will be different. Similarly, if a third party attempts toaccess the data transferred between the sockets, they will onlyhave the ciphertext which can not be read without the correctencryption key. The code below shows how the encryption anddecryption key is created based on the meeting ID entered bythe user. The encryption and decryption object was used beforesending and after receiving the message.

self.encryptionObject =AES.new(self.meetingid.encode("utf-8"),AES.MODE_CFB,'This is an IV456'.encode("utf-8"))

self.decryptionObject=AES.new(self.meetingid.encode("utf-8"),AES.MODE_CFB,'This is an IV456'.encode("utf-8"))

send_frame_encryption =Gui.encryptionObject.encrypt(send_frame)self.sock.sendall(send_frame_encryption)

d = self.conn.recv(recv_size)d = Gui.decryptionObject.decrypt(d)

This platform provides end-to-end encryption that mostvideo conferencing platforms do not have. For example, Zoomstores all the keys for data encryption in its cloud server.Therefore, they could theoretically decrypt your personal dataat any point [32]. However, this educational platform doesnot store encryption keys, so the user’s personal data is onlyaccessible during the call by the intended parties. This meansstudents and teachers can be confident in using a platformwhere their security is a high priority.

C. Translations/Subtitles

In order to use voice commands, NLP was used to convertthe user’s speech into data the computer can interpret andreturn as a string of text. Speech Recognition, a Python libraryequipped with the Google Web Speech API, was used for thisproject. Since accuracy is the chief concern of a NLP program,a method to adjust for ambient noise from the Recognizer

class of Speech Recognition was used to set a baseline for thebackground noise. The PyAudio library was utilized to easilyconnect to the user’s default microphone. After preliminarytests, the speech recognition feature was integrated into themain server code. Using the recognize google method of theRecognizer class, the user’s words were transcribed into a textstring which was then used for the subtitles as shown in thecode below.

r.adjust_for_ambient_noise(source, duration=0.5)self.transText = r.recognize_google(audio).lower()

For foreign languages, a parameter in the recognize googlemethod can be used to specify the language the user isspeaking in. This was made possible with the GoogleSpeech Recognition API, which has language translationcapabilities for over a hundred languages. If the user hasspecified Spanish as their primary language, the programthen translates the Spanish text string into English and storesthat information in the subtitle variable. This ensures that theother user will be receiving translated English subtitles. Onthe receiving end, English subtitles will be translated intoSpanish if the user has specified that Spanish is their primarylanguage. All translations occur on the Spanish speaker’sside in order to decrease lag time. This prevents unnecessarytranslations if there are only English speakers on the videocall. The code below translates Spanish audio into Spanishtext, before translating the Spanish text into English text.

real_audio =r.recognize_google(audio, language="es-ES")

string = str(real_audio)

self.transText =trans.translate(string, scr='es',dest='en').text

D. Polling Feature

This section discusses the implementation of a completelyvoice-driven polling function through the use of voice com-mands and OpenCV for finger detection.

1) Video Commands: Since the polling feature is enabledthrough voice commands, the program constantly searchesthe captions for certain keywords. Multiple keywords werechosen for their relevance to the topic of polling. However, theprogram did not always pick up the keywords. By includinghomophones like “done” and “dunn,” mistranslations of theuser’s audio were accounted for in the voice commands.After ensuring that the program accurately recognized certainkeywords, various methods were considered to obtain therequired information from the user for the poll. Using theGoogle Text-To-Speech (gTTS) Python library, the computer

6

reads prompts to the user. Using state functions, the programcan identify different stages of the poll including initiatingthe poll, creating the question, generating the options, andseeking an answer. The user’s responses were edited to removethe keyword “done” that signaled completion. Finally, theresponses for each part of the poll were graphically displayedon the UI for the student to view. The code below promptsthe user for a question which is later repeated with the answeroptions.

tts = gTTS(text="Say your question anduse done to end", lang='en')tts.save("speech.mp3")playsound("speech.mp3")

2) Voting with OpenCV: After the poll is created, theprogram uses OpenCV to recognize the number of fingersa user holds up as an indicator for which option they areselecting. As the program processes the image frames streamedby the user’s video-camera, it is able to locate the hand’sposition within the frame and identify fingers using basictrigonometry and built-in OpenCV methods.

a) Experimentation of Hand detection: One of the mainfocuses for hand detection was not only the implementation,but also the optimization of the hand identification accuracy.The three main methods of hand detection that were devel-oped were basic color filtering, facial detection-assisted colorfiltering, and background subtraction. To determine the optimalmethod, each one was assessed for reliability.

The first iteration relied solely on basic color filtering inwhich color thresholding was used to differentiate the handfrom the image background [33]. To improve the program’saccuracy response to different lighting, the image colorspacewas first changed from OpenCV’s standard BGR protocol toHSV formatting instead. HSV pixel values are split into threeseparate channels for hue, saturation and value, where coloroccupies a single channel unlike BGR, where color informa-tion is contained inside all three channels. This conversiongreatly simplifies the color range specification since HSV isable to adapt to mixed lighting environments with varyingsaturation and detect the hand as long as it is within the correcthue and color range.

The minimum and maximum boundaries for the thresholdrange were defined to be [0,20,70] and [20, 255, 255] re-spectively. Then, the OpenCV inRange() function convertedthe image array from HSV to a binary (black and white)image. The regions within the frame that did not fall inside thespecified HSV skin color threshold were assigned a value of0 while those that did were assigned a value of 255 [35]. Thisbinary mask that was created was able to display the filteredsilhouette of the hand.

For the next iteration, experimentation with a new methodinvolved incorporating facial detection within the program,

Fig. 10. 3D cylindrical representation of the HSV colorspace. Source:Adapted from [34].

which would ideally be able to identify the square region ofthe image containing the user’s face, and then extract the colorvalues from within the region to create a unique color rangethat is applied to the rest of the image to search for hands.To perform face detection, a pre-trained Haar Cascade Frontalface dataset model was accessed from the official OpenCVrepository and passed into a cascade classifier function. ThisHaar feature-based approach was trained using positive andnegative images to distinguish important edge and line featuresfor facial tracking [36].

Fig. 11. Haar features being applied to a female face to identify various facialstructures. Source: Adapted from [37].

After locating the face, an image segmentation techniqueknown as histogram back projection was used. The programgenerated a histogram that mapped out the range of hue andsaturation values found in the detected face which was thenprojected onto the target region where the user placed theirhand to search for similar skin-color pixel values [38]. Theshape of the resulting matches were displayed in binary formatand the colorspace was converted into grayscale so that thecomputational processing was less intensive for the rest of theprocedure.

The last iteration used a Gaussian blur and background seg-mentation algorithm sourced from a Github repository. Fromthe videostream, a static background frame was captured andthen subtracted from each proceeding frame in order to extractthe distinct foreground features while eliminating unnecessarybackground elements [39], [40]. Using this technique as aform of motion analysis enabled the program to detect whena hand was placed in the foreground of the frame. To adjustthe sensitivity of this detection, the background thresholdingvalue was increased to 90 which filtered out more noise.

7

b) Convolution and Morphology Filtering: Once thehands were detected, the image was converted into binaryformat. Both convolutions and a form of non-linear filteringcalled morphological operators were applied to remove noisefrom the image and polish the hand silhouette to eliminateholes within the mask’s boundaries [41]. Several iterativelayers of either an erosion filter, dilation filter, or both wereapplied to the frame [42]. These type of non-linear filters de-pend on the pixels’ relative ordering rather than their numericalvalues. The erosion operation trims the boundary and slightlyreduces the size, while the dilation operation has the oppositeeffect and expands it. After the white noise was removed, abasic gaussian blur kernel was applied over the mask to refineit even further.

Fig. 12. Side-by-side comparison of hand mask with no filters vs. witherosion, dilation and gaussian applied

c) Finger Tracking: To improve the overall accuracy ofthe readings, a small, cropped section of the frame was definedto be the region-of-interest for detection. Reducing the areaof the frame that the program was processing mitigated thepotential noise interference from other parts of the backgroundand specified a consistent region where the user could placetheir hand. Thus, the dimensions were reduced to a 50x50pixel frame on the right side of the screen. To determine thenumber of fingers a user was holding up, a contour outline wascreated along boundaries of the hand, and a convex hull outlinewas drawn to capture the general shape of the hand [43].Next, the convexityDefects() function was used to identifythe defects, which were points along the contour with largedistance deviations in relation to the hull, representative ofthe gaps in between fingers [44]. For each defect, start andend point, representative of adjacent fingertips, were definedalong the hull, and the actual defect coordinate was chosen byfinding the local minimum along the contour concavity.

Then, the distances between the points were calculatedusing the distance formula. The Law of Cosine was used tofind the approximate angle opening for each gap to gaugewhether it could be considered a valid defect or not, since theangle must be less than 90 degrees. The area within the hulland contour was also calculated separately to generate an arearatio that was used as a threshold when there was no detecteddefect to determine whether or not zero or one finger was heldup. Using this process, the number of fingers, other than whenvoting for one, is always one more than the number of defects.

Fig. 13. The figure above shows the convexity defects of the hands and therespective formulas used to calculate the angle between the fingers

E. Polling Integration

Once the hand detection portion was fully functional, thenext step was integrating it into the polling feature. Initially,the image frames of the student were processed on thereceiving side (the teacher) to determine the number of fingersthe student held up while the poll was active. When theteacher sent a message with the poll questions and options, thestudent’s side activated the necessary steps to prompt the userfor an answer. This method did not involve sending the dataof the result from the student to the teacher through sockets.However, processing the image on the receiving end led toserious issues with the accuracy and reliability of the fingerdetection as it was using individual frames that could have lag.As a result, the image detection was transferred to the sendingside (the student). Although this required additional data to besent from the student to the teacher to convey the results of thepoll, the image detection was using the webcam of the studentwhich led to higher reliability. Once the teacher received ananswer to the poll from the student, a popup appeared withthe results of the poll.

F. Web Search

Most learning platforms do not include features to easilyaccess data from the web related to lessons taught in class.However, this learning platform integrates the voice-to-textsubtitles with the ability to search certain terms or ideas.By selecting a certain phrase or word in the subtitles at thebottom of the interface, the program uses the selected textto automatically open up a new browser window with thespecific search parameter previously determined by the user.This allows the user to quickly search up important ideasthroughout the lecture while remaining attentive. This easy-

8

to-use feature eliminates confusion during the lecture, so thestudent can feel truly comfortable about the material covered.

search = 'https://www.google.com/search?q='+ searchPhrase

webbrowser.open(search, new=1)

G. Screen Sharing

Screen sharing is a crucial feature that allows the teacher toeasily communicate with their students in class. This programallows the student to see the teacher’s screen. This featureis enabled through capturing and sending the frames fromthe teacher’s device rather than the frames from the webcamwhen the screen sharing feature is enabled. Teachers candisplay lecture notes or share websites, which creates a moreinteractive lecture.

screenCapture = array(PIL.ImageGrab.grab())screenCapture2 = cv2.resize(screenCapture,dsize=(160, 120),interpolation=cv2.INTER_AREA)preview = cv2.cvtColor(screenCapture2,cv2.COLOR_BGR2RGBA)

The above code uses the Pillow ImageGrab library in orderto screen share through capturing the frames on an individualdevice.

IV. RESULTS

The outcome of this research was an interactive videoconferencing platform. The following section demonstrates theusage of several features and presents the results of accuracytests performed for speech and image recognition.

A. Remote Server Connection

Through the socket programming model, the applicationwas able to connect both remotely through a LAN networkand through a WAN network with port forwarding. The imagebelow shows two devices connecting remotely outside thesame LAN network. The application was converted to anexecutable file using PyInstaller which could be distributedand run without manually downloading the required Pythonpackages.

B. Web Searching

The web search feature was seamlessly integrated into thesubtitle feature within the overall program. Highlighting asection of text in the subtitles leads to a Google Search ofthe specific phrase.

Fig. 14. Two users connecting remotely

Fig. 15. Web search feature

C. Screen Sharing

The screen sharing feature is activated when either userclicks a button to toggle between displaying their screen andtheir webcam. The screen sharing feature can display on-screen notes and promote an engaging lecture-style class.

D. Accuracy Testing of Speech Recognition

To verify the functionality of the speech-to-text features,several accuracy tests were completed involving a diversegroup of test subjects. After opening the application, the trialparticipants were given pre-written phrases that were phonet-ically diverse and were asked to speak into the microphone(either their personal headphones or the computer audio) toallow the program to transcribe the phrase using the NLPspeech-to-text feature. The phrases transcribed by the featurewere compared to the pre-written phrases. If the phrase wastranscribed correctly, then the trial was recorded as a success.If there was any deviation in the transcribed phrase’s wording,then the trial was denoted as a failure and the deviation was

9

Fig. 16. Screen share feature with notes

recorded. This process was repeated not only for a subsetof Native English speakers, but also for native and non-native Spanish speakers to verify the compatibility with thetranslation feature. Seven different test subjects spoke in oneof two languages (English and Spanish) and over 100 phraseswere tested throughout the trials.

The data was processed with a point system. All correctanswers received 1 point. The averages of this converted dataset provided an accuracy for different subsets of test subjects.The results for each group can be seen in Figures 17, 18, and19.

After combining all the trial data, the tests suggestedthat the usage of the audio features were successful. Theaccuracy for native English speakers was over 90%. However,the incorrect results were failures of the NLP program todifferentiate between certain homophones and correctly placeapostrophes for plural possessives. Additionally, microphoneusage increased the accuracy of results by almost twenty per-cent since the audio was passed directly into the microphone’sreceiver in the headset. For speech recognition in Spanish,the accuracy rate was lower than English, but still well overfifty percent. This discrepancy may be due to the tendency ofSpanish speakers to talk fast which can manifest as unclearspeech. The results did not differ significantly between nativeand non-native Spanish speakers. Due to the way Spanishvoice recognition was conducted, some words were exchangedwith synonyms during the recognition process, but this didnot change the meaning of the sentence. Additionally, testsshowed that continuous speech was better recognized thanfragments because the adjustment for background noise causedthe program to miss the first 0.5 seconds of recording as it doesthis noise adjustment.

Finally, two of the test subjects used a version of the voicerecognition software that did not include an adjustment forambient noise. All other conditions were kept the same asthe previous trials and all background noises were natural.Without an adjustment for background noise, the accuracyrate was dropped to 75% and sometimes, the program did notreturn any text transcription of the audio. When this occurred,

the program could not accurately recognize any words dueto the level of background noise. It is notable that thisphenomenon did not occur a single time when using the voicerecognition program with the adjustment for ambient noise.

Fig. 17. Captions in Spanish

Fig. 18. Accuracy rate by Noise Adjustment

Fig. 19. Accuracy rate by Microphone Usage

Fig. 20. Accuracy rate by Speaker Type

E. Accuracy Testing of Hand Detection Methods

In order to assess the accuracy of detection, several factorshad to be taken into consideration when evaluating the con-sistency of detection across different user environments. Someof the variables including background and lighting affectedthe overall accuracy. The detection capabilities were tested onindividuals with different skin tones to verify its uniformity.For each detection method, several trial participants wereinstructed to hold up either one, two or three fingers for

10

three seconds, record the value that the program returned, andindicate whether it fluctuated or not. This was performed atdifferent times of day to observe how each method respondedto different lighting. The results were compared to determinewhich method to permanently implement into the pollingsequence

The collected data was then scored using a point system.All correct finger detection attempts earned one point whileall incorrect, or fluctuating readings were given no points. Theaverages were taken of this converted data set to obtain theaccuracy for different subsets of test subjects. The results foreach of these methods can be seen in Figures 21 & 22.

Fig. 21. Daytime vs Nighttime

Fig. 22. The figures above show the image and graphical comparisons ofaccuracy of all three methods in two different time instances of a day.

During the trials, the biggest issue with detection usingsimple color filtering was its inconsistency in picking uphand pixels in different lightings. A user working in a roomwith a background similar to their own skin tone wouldexperience inaccuracies in readings since the values of thebackground would also pass through the mask and interferewith finger detection. Detection on users with skin tonesentirely outside the predetermined range would not work atall since the HSV threshold range was only representativeof a limited selection of skin complexions. For instance, thismethod worked relatively well in broad daylight, but whenapplied alongside an artificial light source in the evening, theprogram was unable to accurately detect hands without picking

up excess noise from the background and the accuracy rate fellto 13.29%.

Similar to the first method, facial detection-assisted colorfiltering also had difficulty sensing a hand when the user’shand and their background shared similar colors. Although itwas able to discern the hand easily, since the color thresholdwas adjusted based on the face’s skin tone, significant quanti-ties of noise was picked up from the background which causedfluctuations in the accuracy and made detection imprecise.The results were more consistent than basic color filtering,but detection worked correctly less than 3/4 of the time.

For the background subtraction method, users were asked toleave their hand up initially (to capture the background framesbefore detection occurred) and then remove it when promptedby the program for detection on the negative silhouette. Theresults for the background subtraction had the highest rates ofaccuracy in both lighting settings. Attempting hand detectionusing this method worked consistently and had a near perfectaccuracy as long as no other features in the background moved.This method was chosen for final integration into the program.

In order to maximize the background subtraction method’seffectiveness within the program, another test was conductedto determine the timing of hand placement. The two optionsconsidered were having the user put their hand up duringbackground capture and remove it before the detection periodso that its negative static silhouette is processed, or havingthem leave their hand down initially and then raise it afterthe background image was captured. To verify that it workedwell in different background environments, the process wasrepeated by placing the hand against a plain background, andthen against a background with clutter. The user was asked tohold up either 1, 2, or 3 fingers multiple times and recordwhether the program returned the right amount of fingers.Raising the hand before background capture was 10% moreaccurate than raising the hand after capture. Additionally, thebackground clutter did not significantly affect the accuracy,although detection in backgrounds with less clutter wereslightly more accurate.

Fig. 23. Accuracy for hand raised after

11

Fig. 24. Accuracy for hand raised initially

Fig. 25. Poll feature in action

V. CONCLUSION

A. Summary

This platform strives to improve upon video conferenc-ing platforms that are available for public use and addressissues that are currently neglected. The final outcome wasa telecommunications software targeted towards the educa-tion industry. The code for the full program is availableat https://github.com/joosthemoos/gset2020vdlp. The imple-mentation of an intuitive user interface allows for students ofall ages to utilize this product. This is of incredible impor-tance as students are less receptive to learning over a virtualplatform. This is remedied through the application of voice-driven technology. Many people (including groups with certainlearning disabilities) have trouble navigating commonly usedvideo conferencing platforms, despite simplistic UI. However,verbal communication with the server enables these groupsto still use features which would normally be a challengeto interact with. Moreover, the inclusion of object recogni-tion allows students to simulate the classroom experience byinvolving physical interaction. Additionally, the live subtitlesand ability to display a transcription of the lecture enables allstudents to have a better understanding of class information.This also appeals to the ESL (English as a Second Language)Community, with the option to translate English speech toSpanish. This could be easily adapted to multiple languagesto address a wider audience.

B. Future Plans

This project was mainly used as a proof-of-concept formultiple novel features including a completely voice-drivenpolling system. The platform could be expanded in the future.Most notably, the server currently handles two users due tothe computing power of a single device. More users could beadded if this program was hosted on a cloud-based server witha larger computational capacity. For the screen sharing feature,the presenter could transmit both the video and screen contentsat the same time through increasing the amount of data sent.Lastly, emotion tracking would be an effective feature thatcould allow the professor to gauge students’ understandingof the material, closely resembling the environment of an in-person class. This would require training a neural networkto recognize both emotions and mental states like boredomand confusion. The aforementioned advancements would helpcreate a more innovative educational platform.

ACKNOWLEDGMENTS

The authors of this paper gratefully thank the following:project mentor Jessica Kuleshov for her extensive experience;Residential Teaching Assistant Akila Saravanan for her invalu-able support and efforts in coordination and communication;Research Coordinator Benjamin Lee and Head CounselorRajas Karajgikar for their guidance and feedback; ProgramDirector Ilene Rosen, Ed. D. and Associate Program DirectorJean Patrick Antoine, without whom this project would nothave been possible. Finally, the authors appreciate RutgersUniversity, Rutgers School of Engineering and The State ofNew Jersey as well as the generosity of the sponsors of theNew Jersey Governor’s School of Engineering and Technol-ogy: Lockheed Martin, the NJ Space Grant Consortium, andNJ GSET Alumni.

REFERENCES

[1] ”Fast Facts: Distance learning (80)”, Nces.ed.gov, 2020.[2] K. Gutierrez, ”Facts and Stats That Reveal The Power Of eLearning

[Infographic]”, Shiftelearning.com, 2020.[3] C. Li and F. Lalani, ”The COVID-19 pandemic has changed education

forever. This is how”, World Economic Forum, 2020.[4] S. Loeb, ”How Effective Is Online Learning? What the Research Does

and Doesn’t Tell Us”, Education Week, 2020.[5] L. Dignan, ”Online learning gets its moment due to COVID-19

pandemic: Here’s how education will change — ZDNet”, ZDNet,2020.

[6] A. Valentini, J. Ricketts, R. Pye and C. Houston-Price, ”Listeningwhile reading promotes word learning from stories”, Journal ofExperimental Child Psychology, 2018. [Accessed 18 July 2020].

[7] ”CS 50 Software Design and Implementation: Lecture 19”,Cs.dartmouth.edu, 2020.

[8] ”What is Encryption & How Does It Work?”, Medium, 2020.[9] ”Advanced Encryption Standard - Tutorialspoint”, Tutorialspoint.com,

2020.[10] ”Exclusive-OR Gate Tutorial”, Electronics Tutorials, 2020.[11] I. Mihajlovic, ”Everything You Ever Wanted To Know About

Computer Vision. Here’s A Look Why It’s So Awesome.”, Medium,2020.

[12] ”An Introductory Guide to Computer Vision”, Tryolabs.com, 2020.[13] ”What’s the Difference between Computer Vision, Image Processing

and Machine Learning?”, RSIP Vision, 2020.

12

https://github.com/joosthemoos/gset2020vdlp

[14] ”About”, Opencv.org, 2020.[15] ”What is a Pixel? - Ultimate Photo Tips”, Ultimate Photo Tips, 2020.[16] ”Image Types”, Harrisgeospatial.com, 2020.[17] ”Image Pixels — Shutha”, Shutha.org, 2020. [Online]. Available:

http://shutha.org/node/789. [Accessed: 18- Jul-2020].[18] D. Searls, ”Image Processing”, Dsearls.org, 2020.[19] ”Convolutional Neural Network Basics”, Rohanverma.net, 2020.[20] ”Padding (Machine Learning)”, DeepAI, 2020.[21] U. Sinha, ”Convolutions: Image convolution examples - AI Shack”,

Aishack.in, 2020.[22] ”Spatial Filters - Gaussian Smoothing”, Homepages.inf.ed.ac.uk, 2020.[23] ”Blurring images – Image Processing with Python”, Datacarpentry.org,

2020.[24] S. Bird, E. Klein and E. Loper, Natural language processing with

Python. Beijing: O’Reilly, 2009.[25] R. Shaikh, ”Gentle Start to Natural Language Processing using

Python”, Medium, 2020.[26] D. Amos, ”The Ultimate Guide To Speech Recognition With Python –

Real Python”, Realpython.com, 2020.[27] ”PortAudio - an Open-Source Cross-Platform Audio API”,

Portaudio.com, 2020.[28] J. Anderson, ”Python Bindings: Calling C or C++ From Python – Real

Python”, Real Python, 2020.[29] S. Doyle, ”TCP vs. UDP: Understanding the Difference”, Private

Internet Access Blog, 2020.[30] A. Kumar, ”Video Conferencing Using Sockets in Python 3.”,

Medium, 2018.[31] ”TCP Audio/Video Chat — DBC Projects”, Dbc-projects.net, 2020.[32] L. Newman, ”So Wait, How Encrypted Are Zoom Meetings Really?”,

Wired, 2020.[33] ”Sadaival/Hand-Gestures”, GitHub, 2020.[34] ”OpenCV: Thresholding Operations using inRange”, Docs.opencv.org,

2020.[35] ”OpenCV: Operations on arrays”, Docs.opencv.org, 2020.[36] M. Swain and D. Ballard, ”Histogram Backprojection”, TheAILearner,

2020.[37] S. Kumar, ”Haar cascade Face Identification”, Medium, 2020.[38] ”opencv/opencv”, GitHub, 2020.[39] ”lzane/Fingers-Detection-using-OpenCV-and-Python”, GitHub, 2020.[40] ”OpenCV: Motion Analysis”, Docs.opencv.org, 2020.[41] ”Morphology Filters”, Theobjects.com, 2020.[42] ”Morphological Transformations — OpenCV-Python Tutorials 1

documentation”, Opencv-python-tutroals.readthedocs.io, 2020.[43] ”OpenCV: Contours : Getting Started”, Docs.opencv.org, 2020.[44] ”OpenCV: Contours : More Functions”, Docs.opencv.org, 2020.

13

Integrating Natural Language Processing and Computer ... · of Natural Language Processing (NLP)...

Documents

Transcript of Integrating Natural Language Processing and Computer ... · of Natural Language Processing (NLP)...